PROJECT INTRODUCTION
In this project, using dataset public in HomeCredit competition for predictive default loan in credit risk. To understand better the background of this project, let's walk through this company profile and why they need to predict probability of default.
Home Credit is a global consumer finance provider focusing on responsible lending to underbanked populations, leveraging technology for access to financial services. In other to reduce credit loss where customers are not able to pay back the loan, one of the methods is assessing customer's probability of default before granting certain loan amount or credit limit, via authorized data on the applicant. In this scope of project, will focus on Cash Loan product, which is to predict probability of default on applicant applying for a certain loan amount.
For this problem of classification on whether or not customer will be default, supervised learning algorithm will be used with target to build main model using Logistic Regression. As requested in this scope of competition, AUC metric will be used to evaluate model's performance, with aim to achieve minimum 0.68. Additionally, AIC metric will also be used to measure model's goodness and complexity, in order to select most efficient model.
For feature selection criteria, IV and WoE will be used to measure feature's importance, as well as p-value for feature significant.
Source of data reference:
HOME CREDIT GROUP · FEATURED PREDICTION COMPETITION (2018) ·
https://www.kaggle.com/competitions/home-credit-default-risk/data
</font>
#import necessary tools
import pandas as pd
import numpy as np
pd.options.display.float_format = '{:.2f}'.format
pd.options.mode.chained_assignment = None
from sklearn.impute import SimpleImputer
import seaborn as sns
import matplotlib.pyplot as plt
#import data
path_desc = '/11. Machine Learning/11.1 Intro to ML Supervised Learning/final project tips/dataset for project/HC data/HomeCredit_columns_description.csv'
path_apptrain = '/11. Machine Learning/11.1 Intro to ML Supervised Learning/final project tips/dataset for project/HC data/application_train.csv'
path_apptest = '/11. Machine Learning/11.1 Intro to ML Supervised Learning/final project tips/dataset for project/HC data/application_test.csv'
data_desc = pd.read_csv(path_desc, encoding='cp1252')
data_apptrain = pd.read_csv(path_apptrain, encoding='cp1252')
data_apptest = pd.read_csv(path_apptest, encoding='cp1252')
data_apptrain.shape
(307511, 122)
As the original dataset contain 307511 rows and 122 columns, including both Cash Loans and Revolving Loans. In this scope of project, filter to use dataset of only cash loan product:
df = data_apptrain[data_apptrain['NAME_CONTRACT_TYPE']== 'Cash loans']
df.shape
(278232, 122)
Remaining size of Cash Loans dataset is 278232 rows x 122 collumns will be used for data exploratory and model training and assessment onwards. For cross check and identify proper method during data cleaning process, data_apptest on Cash Loans with 48305 rows x 121 cols will also be used to double check of logic, but will not participate to model training.
In this section, we will inspect data and focus on below step:
1. handling missing data;
2. handling incorrect datatype;
3. impute data inconsistency.
#check datatype of data_trainCL
set_dtype = set()
for i in range(len(df.columns)):
set_dtype.add(df.iloc[:,i].dtypes)
print(set_dtype)
{dtype('O'), dtype('int64'), dtype('float64')}
#inspecting missing values and number of values
def missing_inspection(df):
numeric_missing = {}
cat_missing = {}
for i in range(len(df.columns)):
if df.iloc[:,i].dtypes == 'int64' or df.iloc[:,i].dtypes == 'float64':
print (f'missing value in column {df.columns[i]} is {df.iloc[:,i].isnull().sum()/len(df)} \n')
if df.iloc[:,i].isnull().sum() > 0:
numeric_missing[df.columns[i]]=df.iloc[:,i].isnull().sum()
else:
print(f'number of missing values in this column is {df.iloc[:,i].isnull().sum()/len(df)} \n{df.iloc[:,i].value_counts(dropna = False)} \n')
if df.iloc[:,i].isnull().sum() > 0:
cat_missing[df.columns[i]]=df.iloc[:,i].isnull().sum()
return print(f'Summary of missing data: {len(numeric_missing.keys())+len(cat_missing.keys())} cols has missing value, in which Numeric is {len(numeric_missing.keys())} cols and Category is {len(cat_missing.keys())} cols \n \n NUMERIC: \n {numeric_missing}, \n \n CATEGORY: \n {cat_missing}')
#checking characters of missing value in data_apptest
missing_inspection(data_apptest[data_apptest['NAME_CONTRACT_TYPE']=='Cash loans'])
missing value in column SK_ID_CURR is 0.0
number of missing values in this column is 0.0
Cash loans 48305
Name: NAME_CONTRACT_TYPE, dtype: int64
number of missing values in this column is 0.0
F 32409
M 15896
Name: CODE_GENDER, dtype: int64
number of missing values in this column is 0.0
N 32033
Y 16272
Name: FLAG_OWN_CAR, dtype: int64
number of missing values in this column is 0.0
Y 33295
N 15010
Name: FLAG_OWN_REALTY, dtype: int64
missing value in column CNT_CHILDREN is 0.0
missing value in column AMT_INCOME_TOTAL is 0.0
missing value in column AMT_CREDIT is 0.0
missing value in column AMT_ANNUITY is 0.0004968429769175033
missing value in column AMT_GOODS_PRICE is 0.0
number of missing values in this column is 0.018859331332160233
Unaccompanied 39357
Family 5834
Spouse, partner 1437
NaN 911
Children 405
Other_B 206
Other_A 108
Group of people 47
Name: NAME_TYPE_SUITE, dtype: int64
number of missing values in this column is 0.0
Working 24371
Commercial associate 11246
Pensioner 9185
State servant 3501
Student 2
Name: NAME_INCOME_TYPE, dtype: int64
number of missing values in this column is 0.0
Secondary / secondary special 33747
Higher education 12330
Incomplete higher 1713
Lower secondary 474
Academic degree 41
Name: NAME_EDUCATION_TYPE, dtype: int64
number of missing values in this column is 0.0
Married 32002
Single / not married 6954
Civil marriage 4220
Separated 2938
Widow 2191
Name: NAME_FAMILY_STATUS, dtype: int64
number of missing values in this column is 0.0
House / apartment 43238
With parents 2224
Municipal apartment 1602
Rented apartment 714
Office apartment 404
Co-op apartment 123
Name: NAME_HOUSING_TYPE, dtype: int64
missing value in column REGION_POPULATION_RELATIVE is 0.0
missing value in column DAYS_BIRTH is 0.0
missing value in column DAYS_EMPLOYED is 0.0
missing value in column DAYS_REGISTRATION is 0.0
missing value in column DAYS_ID_PUBLISH is 0.0
missing value in column OWN_CAR_AGE is 0.6631611634406376
missing value in column FLAG_MOBIL is 0.0
missing value in column FLAG_EMP_PHONE is 0.0
missing value in column FLAG_WORK_PHONE is 0.0
missing value in column FLAG_CONT_MOBILE is 0.0
missing value in column FLAG_PHONE is 0.0
missing value in column FLAG_EMAIL is 0.0
number of missing values in this column is 0.3197805610185281
NaN 15447
Laborers 8597
Sales staff 5040
Core staff 4313
Managers 3519
Drivers 2749
High skill tech staff 1833
Accountants 1614
Medicine staff 1312
Security staff 909
Cooking staff 888
Cleaning staff 655
Private service staff 452
Low-skill Laborers 269
Secretaries 212
Waiters/barmen staff 177
Realty agents 136
HR staff 104
IT staff 79
Name: OCCUPATION_TYPE, dtype: int64
missing value in column CNT_FAM_MEMBERS is 0.0
missing value in column REGION_RATING_CLIENT is 0.0
missing value in column REGION_RATING_CLIENT_W_CITY is 0.0
number of missing values in this column is 0.0
TUESDAY 9661
WEDNESDAY 8371
MONDAY 8338
THURSDAY 8315
FRIDAY 7197
SATURDAY 4579
SUNDAY 1844
Name: WEEKDAY_APPR_PROCESS_START, dtype: int64
missing value in column HOUR_APPR_PROCESS_START is 0.0
missing value in column REG_REGION_NOT_LIVE_REGION is 0.0
missing value in column REG_REGION_NOT_WORK_REGION is 0.0
missing value in column LIVE_REGION_NOT_WORK_REGION is 0.0
missing value in column REG_CITY_NOT_LIVE_CITY is 0.0
missing value in column REG_CITY_NOT_WORK_CITY is 0.0
missing value in column LIVE_CITY_NOT_WORK_CITY is 0.0
number of missing values in this column is 0.0
Business Entity Type 3 10744
XNA 9185
Self-employed 5874
Other 2681
Medicine 1709
Government 1493
Business Entity Type 2 1466
Trade: type 7 1293
School 1278
Kindergarten 1035
Construction 1033
Business Entity Type 1 877
Transport: type 4 876
Trade: type 3 566
Military 525
Industry: type 9 491
Industry: type 3 485
Security 467
Transport: type 2 447
Police 438
Housing 432
Industry: type 11 415
Bank 369
Security Ministries 338
Services 298
Postal 293
Agriculture 292
Restaurant 281
Trade: type 2 230
University 218
Industry: type 7 215
Industry: type 1 176
Transport: type 3 174
Industry: type 4 166
Electricity 155
Hotel 132
Trade: type 6 118
Industry: type 5 96
Telecom 94
Emergency 90
Insurance 77
Industry: type 12 76
Industry: type 2 76
Realtor 72
Advertising 70
Trade: type 1 63
Culture 60
Legal Services 52
Mobile 45
Cleaning 41
Transport: type 1 34
Industry: type 6 27
Industry: type 10 24
Trade: type 4 14
Religion 12
Trade: type 5 9
Industry: type 13 6
Industry: type 8 2
Name: ORGANIZATION_TYPE, dtype: int64
missing value in column EXT_SOURCE_1 is 0.4209916157747645
missing value in column EXT_SOURCE_2 is 0.00014491253493427182
missing value in column EXT_SOURCE_3 is 0.17780768036435152
missing value in column APARTMENTS_AVG is 0.49065314149673944
missing value in column BASEMENTAREA_AVG is 0.5675188903840183
missing value in column YEARS_BEGINEXPLUATATION_AVG is 0.46966152572197495
missing value in column YEARS_BUILD_AVG is 0.6531621985301729
missing value in column COMMONAREA_AVG is 0.6874650657281854
missing value in column ELEVATORS_AVG is 0.5174619604595797
missing value in column ENTRANCES_AVG is 0.4844219024945658
missing value in column FLOORSMAX_AVG is 0.4791429458648173
missing value in column FLOORSMIN_AVG is 0.6665355553255357
missing value in column LANDAREA_AVG is 0.5801676845047097
missing value in column LIVINGAPARTMENTS_AVG is 0.672849601490529
missing value in column LIVINGAREA_AVG is 0.4839250595176483
missing value in column NONLIVINGAPARTMENTS_AVG is 0.6845047096573854
missing value in column NONLIVINGAREA_AVG is 0.5357623434427078
missing value in column APARTMENTS_MODE is 0.49065314149673944
missing value in column BASEMENTAREA_MODE is 0.5675188903840183
missing value in column YEARS_BEGINEXPLUATATION_MODE is 0.46966152572197495
missing value in column YEARS_BUILD_MODE is 0.6531621985301729
missing value in column COMMONAREA_MODE is 0.6874650657281854
missing value in column ELEVATORS_MODE is 0.5174619604595797
missing value in column ENTRANCES_MODE is 0.4844219024945658
missing value in column FLOORSMAX_MODE is 0.4791429458648173
missing value in column FLOORSMIN_MODE is 0.6665355553255357
missing value in column LANDAREA_MODE is 0.5801676845047097
missing value in column LIVINGAPARTMENTS_MODE is 0.672849601490529
missing value in column LIVINGAREA_MODE is 0.4839250595176483
missing value in column NONLIVINGAPARTMENTS_MODE is 0.6845047096573854
missing value in column NONLIVINGAREA_MODE is 0.5357623434427078
missing value in column APARTMENTS_MEDI is 0.49065314149673944
missing value in column BASEMENTAREA_MEDI is 0.5675188903840183
missing value in column YEARS_BEGINEXPLUATATION_MEDI is 0.46966152572197495
missing value in column YEARS_BUILD_MEDI is 0.6531621985301729
missing value in column COMMONAREA_MEDI is 0.6874650657281854
missing value in column ELEVATORS_MEDI is 0.5174619604595797
missing value in column ENTRANCES_MEDI is 0.4844219024945658
missing value in column FLOORSMAX_MEDI is 0.4791429458648173
missing value in column FLOORSMIN_MEDI is 0.6665355553255357
missing value in column LANDAREA_MEDI is 0.5801676845047097
missing value in column LIVINGAPARTMENTS_MEDI is 0.672849601490529
missing value in column LIVINGAREA_MEDI is 0.4839250595176483
missing value in column NONLIVINGAPARTMENTS_MEDI is 0.6845047096573854
missing value in column NONLIVINGAREA_MEDI is 0.5357623434427078
number of missing values in this column is 0.6730980229789877
NaN 32514
reg oper account 12001
reg oper spec account 1971
org spec account 911
not specified 908
Name: FONDKAPREMONT_MODE, dtype: int64
number of missing values in this column is 0.4852706759134665
block of flats 24404
NaN 23441
specific housing 256
terraced house 204
Name: HOUSETYPE_MODE, dtype: int64
missing value in column TOTALAREA_MODE is 0.46485871027843906
number of missing values in this column is 0.4908394576130835
NaN 23710
Panel 11147
Stone, brick 10334
Block 1418
Wooden 783
Mixed 351
Monolithic 283
Others 279
Name: WALLSMATERIAL_MODE, dtype: int64
number of missing values in this column is 0.4563709760894317
No 25906
NaN 22045
Yes 354
Name: EMERGENCYSTATE_MODE, dtype: int64
missing value in column OBS_30_CNT_SOCIAL_CIRCLE is 2.0701790704895972e-05
missing value in column DEF_30_CNT_SOCIAL_CIRCLE is 2.0701790704895972e-05
missing value in column OBS_60_CNT_SOCIAL_CIRCLE is 2.0701790704895972e-05
missing value in column DEF_60_CNT_SOCIAL_CIRCLE is 2.0701790704895972e-05
missing value in column DAYS_LAST_PHONE_CHANGE is 0.0
missing value in column FLAG_DOCUMENT_2 is 0.0
missing value in column FLAG_DOCUMENT_3 is 0.0
missing value in column FLAG_DOCUMENT_4 is 0.0
missing value in column FLAG_DOCUMENT_5 is 0.0
missing value in column FLAG_DOCUMENT_6 is 0.0
missing value in column FLAG_DOCUMENT_7 is 0.0
missing value in column FLAG_DOCUMENT_8 is 0.0
missing value in column FLAG_DOCUMENT_9 is 0.0
missing value in column FLAG_DOCUMENT_10 is 0.0
missing value in column FLAG_DOCUMENT_11 is 0.0
missing value in column FLAG_DOCUMENT_12 is 0.0
missing value in column FLAG_DOCUMENT_13 is 0.0
missing value in column FLAG_DOCUMENT_14 is 0.0
missing value in column FLAG_DOCUMENT_15 is 0.0
missing value in column FLAG_DOCUMENT_16 is 0.0
missing value in column FLAG_DOCUMENT_17 is 0.0
missing value in column FLAG_DOCUMENT_18 is 0.0
missing value in column FLAG_DOCUMENT_19 is 0.0
missing value in column FLAG_DOCUMENT_20 is 0.0
missing value in column FLAG_DOCUMENT_21 is 0.0
missing value in column AMT_REQ_CREDIT_BUREAU_HOUR is 0.12404512990373667
missing value in column AMT_REQ_CREDIT_BUREAU_DAY is 0.12404512990373667
missing value in column AMT_REQ_CREDIT_BUREAU_WEEK is 0.12404512990373667
missing value in column AMT_REQ_CREDIT_BUREAU_MON is 0.12404512990373667
missing value in column AMT_REQ_CREDIT_BUREAU_QRT is 0.12404512990373667
missing value in column AMT_REQ_CREDIT_BUREAU_YEAR is 0.12404512990373667
Summary of missing data: 64 cols has missing value, in which Numeric is 58 cols and Category is 6 cols
NUMERIC:
{'AMT_ANNUITY': 24, 'OWN_CAR_AGE': 32034, 'EXT_SOURCE_1': 20336, 'EXT_SOURCE_2': 7, 'EXT_SOURCE_3': 8589, 'APARTMENTS_AVG': 23701, 'BASEMENTAREA_AVG': 27414, 'YEARS_BEGINEXPLUATATION_AVG': 22687, 'YEARS_BUILD_AVG': 31551, 'COMMONAREA_AVG': 33208, 'ELEVATORS_AVG': 24996, 'ENTRANCES_AVG': 23400, 'FLOORSMAX_AVG': 23145, 'FLOORSMIN_AVG': 32197, 'LANDAREA_AVG': 28025, 'LIVINGAPARTMENTS_AVG': 32502, 'LIVINGAREA_AVG': 23376, 'NONLIVINGAPARTMENTS_AVG': 33065, 'NONLIVINGAREA_AVG': 25880, 'APARTMENTS_MODE': 23701, 'BASEMENTAREA_MODE': 27414, 'YEARS_BEGINEXPLUATATION_MODE': 22687, 'YEARS_BUILD_MODE': 31551, 'COMMONAREA_MODE': 33208, 'ELEVATORS_MODE': 24996, 'ENTRANCES_MODE': 23400, 'FLOORSMAX_MODE': 23145, 'FLOORSMIN_MODE': 32197, 'LANDAREA_MODE': 28025, 'LIVINGAPARTMENTS_MODE': 32502, 'LIVINGAREA_MODE': 23376, 'NONLIVINGAPARTMENTS_MODE': 33065, 'NONLIVINGAREA_MODE': 25880, 'APARTMENTS_MEDI': 23701, 'BASEMENTAREA_MEDI': 27414, 'YEARS_BEGINEXPLUATATION_MEDI': 22687, 'YEARS_BUILD_MEDI': 31551, 'COMMONAREA_MEDI': 33208, 'ELEVATORS_MEDI': 24996, 'ENTRANCES_MEDI': 23400, 'FLOORSMAX_MEDI': 23145, 'FLOORSMIN_MEDI': 32197, 'LANDAREA_MEDI': 28025, 'LIVINGAPARTMENTS_MEDI': 32502, 'LIVINGAREA_MEDI': 23376, 'NONLIVINGAPARTMENTS_MEDI': 33065, 'NONLIVINGAREA_MEDI': 25880, 'TOTALAREA_MODE': 22455, 'OBS_30_CNT_SOCIAL_CIRCLE': 1, 'DEF_30_CNT_SOCIAL_CIRCLE': 1, 'OBS_60_CNT_SOCIAL_CIRCLE': 1, 'DEF_60_CNT_SOCIAL_CIRCLE': 1, 'AMT_REQ_CREDIT_BUREAU_HOUR': 5992, 'AMT_REQ_CREDIT_BUREAU_DAY': 5992, 'AMT_REQ_CREDIT_BUREAU_WEEK': 5992, 'AMT_REQ_CREDIT_BUREAU_MON': 5992, 'AMT_REQ_CREDIT_BUREAU_QRT': 5992, 'AMT_REQ_CREDIT_BUREAU_YEAR': 5992},
CATEGORY:
{'NAME_TYPE_SUITE': 911, 'OCCUPATION_TYPE': 15447, 'FONDKAPREMONT_MODE': 32514, 'HOUSETYPE_MODE': 23441, 'WALLSMATERIAL_MODE': 23710, 'EMERGENCYSTATE_MODE': 22045}
#comparing to characters of missing value in training set
missing_inspection(df)
missing value in column SK_ID_CURR is 0.0
missing value in column TARGET is 0.0
number of missing values in this column is 0.0
Cash loans 278232
Name: NAME_CONTRACT_TYPE, dtype: int64
number of missing values in this column is 0.0
F 182800
M 95432
Name: CODE_GENDER, dtype: int64
number of missing values in this column is 0.0
N 183775
Y 94457
Name: FLAG_OWN_CAR, dtype: int64
number of missing values in this column is 0.0
Y 190207
N 88025
Name: FLAG_OWN_REALTY, dtype: int64
missing value in column CNT_CHILDREN is 0.0
missing value in column AMT_INCOME_TOTAL is 0.0
missing value in column AMT_CREDIT is 0.0
missing value in column AMT_ANNUITY is 4.312947468299836e-05
missing value in column AMT_GOODS_PRICE is 0.0
number of missing values in this column is 0.0036192817504816124
Unaccompanied 224541
Family 36687
Spouse, partner 10431
Children 2965
Other_B 1592
NaN 1007
Other_A 766
Group of people 243
Name: NAME_TYPE_SUITE, dtype: int64
number of missing values in this column is 0.0
Working 142719
Commercial associate 63652
Pensioner 51993
State servant 19836
Student 15
Unemployed 15
Maternity leave 2
Name: NAME_INCOME_TYPE, dtype: int64
number of missing values in this column is 0.0
Secondary / secondary special 200125
Higher education 65321
Incomplete higher 9032
Lower secondary 3608
Academic degree 146
Name: NAME_EDUCATION_TYPE, dtype: int64
number of missing values in this column is 0.0
Married 178711
Single / not married 39709
Civil marriage 26981
Separated 17846
Widow 14985
Name: NAME_FAMILY_STATUS, dtype: int64
number of missing values in this column is 0.0
House / apartment 247389
With parents 12981
Municipal apartment 10237
Rented apartment 4290
Office apartment 2341
Co-op apartment 994
Name: NAME_HOUSING_TYPE, dtype: int64
missing value in column REGION_POPULATION_RELATIVE is 0.0
missing value in column DAYS_BIRTH is 0.0
missing value in column DAYS_EMPLOYED is 0.0
missing value in column DAYS_REGISTRATION is 0.0
missing value in column DAYS_ID_PUBLISH is 0.0
missing value in column OWN_CAR_AGE is 0.6605279047701199
missing value in column FLAG_MOBIL is 0.0
missing value in column FLAG_EMP_PHONE is 0.0
missing value in column FLAG_WORK_PHONE is 0.0
missing value in column FLAG_CONT_MOBILE is 0.0
missing value in column FLAG_PHONE is 0.0
missing value in column FLAG_EMAIL is 0.0
number of missing values in this column is 0.31915811265418786
NaN 88800
Laborers 50131
Sales staff 28534
Core staff 24374
Managers 18739
Drivers 17248
High skill tech staff 10008
Accountants 8607
Medicine staff 7740
Security staff 6178
Cooking staff 5437
Cleaning staff 4235
Private service staff 2359
Low-skill Laborers 1930
Waiters/barmen staff 1170
Secretaries 1169
Realty agents 675
HR staff 478
IT staff 420
Name: OCCUPATION_TYPE, dtype: int64
missing value in column CNT_FAM_MEMBERS is 0.0
missing value in column REGION_RATING_CLIENT is 0.0
missing value in column REGION_RATING_CLIENT_W_CITY is 0.0
number of missing values in this column is 0.0
TUESDAY 49110
WEDNESDAY 47035
MONDAY 45954
THURSDAY 45815
FRIDAY 45383
SATURDAY 30495
SUNDAY 14440
Name: WEEKDAY_APPR_PROCESS_START, dtype: int64
missing value in column HOUR_APPR_PROCESS_START is 0.0
missing value in column REG_REGION_NOT_LIVE_REGION is 0.0
missing value in column REG_REGION_NOT_WORK_REGION is 0.0
missing value in column LIVE_REGION_NOT_WORK_REGION is 0.0
missing value in column REG_CITY_NOT_LIVE_CITY is 0.0
missing value in column REG_CITY_NOT_WORK_CITY is 0.0
missing value in column LIVE_CITY_NOT_WORK_CITY is 0.0
number of missing values in this column is 0.0
Business Entity Type 3 60755
XNA 52008
Self-employed 34518
Other 15014
Medicine 10118
Government 9469
Business Entity Type 2 9363
School 8067
Trade: type 7 7058
Kindergarten 6273
Construction 6084
Business Entity Type 1 5245
Transport: type 4 4925
Trade: type 3 3052
Industry: type 9 3022
Industry: type 3 2986
Security 2955
Housing 2715
Industry: type 11 2433
Military 2379
Agriculture 2293
Bank 2111
Police 2086
Transport: type 2 2002
Postal 1985
Security Ministries 1763
Restaurant 1610
Trade: type 2 1599
Services 1400
University 1174
Industry: type 7 1164
Transport: type 3 1081
Industry: type 1 950
Hotel 879
Electricity 843
Industry: type 4 767
Trade: type 6 551
Industry: type 5 538
Telecom 523
Insurance 513
Emergency 505
Industry: type 2 409
Advertising 368
Realtor 333
Culture 330
Industry: type 12 323
Trade: type 1 300
Mobile 268
Legal Services 259
Cleaning 236
Transport: type 1 179
Industry: type 6 97
Industry: type 10 94
Religion 79
Industry: type 13 59
Trade: type 4 54
Trade: type 5 46
Industry: type 8 22
Name: ORGANIZATION_TYPE, dtype: int64
missing value in column EXT_SOURCE_1 is 0.5679253285028322
missing value in column EXT_SOURCE_2 is 0.0021528796112596682
missing value in column EXT_SOURCE_3 is 0.19644756893527704
missing value in column APARTMENTS_AVG is 0.5118318525547025
missing value in column BASEMENTAREA_AVG is 0.5892708243480261
missing value in column YEARS_BEGINEXPLUATATION_AVG is 0.4921720003450358
missing value in column YEARS_BUILD_AVG is 0.6684745104804624
missing value in column COMMONAREA_AVG is 0.7017021766008223
missing value in column ELEVATORS_AVG is 0.5372890249863423
missing value in column ENTRANCES_AVG is 0.5079286350958911
missing value in column FLOORSMAX_AVG is 0.5019875499583082
missing value in column FLOORSMIN_AVG is 0.6818554300008626
missing value in column LANDAREA_AVG is 0.5978176485810403
missing value in column LIVINGAPARTMENTS_AVG is 0.6867039017798097
missing value in column LIVINGAREA_AVG is 0.506113603036315
missing value in column NONLIVINGAPARTMENTS_AVG is 0.697453923344547
missing value in column NONLIVINGAREA_AVG is 0.5560179993674343
missing value in column APARTMENTS_MODE is 0.5118318525547025
missing value in column BASEMENTAREA_MODE is 0.5892708243480261
missing value in column YEARS_BEGINEXPLUATATION_MODE is 0.4921720003450358
missing value in column YEARS_BUILD_MODE is 0.6684745104804624
missing value in column COMMONAREA_MODE is 0.7017021766008223
missing value in column ELEVATORS_MODE is 0.5372890249863423
missing value in column ENTRANCES_MODE is 0.5079286350958911
missing value in column FLOORSMAX_MODE is 0.5019875499583082
missing value in column FLOORSMIN_MODE is 0.6818554300008626
missing value in column LANDAREA_MODE is 0.5978176485810403
missing value in column LIVINGAPARTMENTS_MODE is 0.6867039017798097
missing value in column LIVINGAREA_MODE is 0.506113603036315
missing value in column NONLIVINGAPARTMENTS_MODE is 0.697453923344547
missing value in column NONLIVINGAREA_MODE is 0.5560179993674343
missing value in column APARTMENTS_MEDI is 0.5118318525547025
missing value in column BASEMENTAREA_MEDI is 0.5892708243480261
missing value in column YEARS_BEGINEXPLUATATION_MEDI is 0.4921720003450358
missing value in column YEARS_BUILD_MEDI is 0.6684745104804624
missing value in column COMMONAREA_MEDI is 0.7017021766008223
missing value in column ELEVATORS_MEDI is 0.5372890249863423
missing value in column ENTRANCES_MEDI is 0.5079286350958911
missing value in column FLOORSMAX_MEDI is 0.5019875499583082
missing value in column FLOORSMIN_MEDI is 0.6818554300008626
missing value in column LANDAREA_MEDI is 0.5978176485810403
missing value in column LIVINGAPARTMENTS_MEDI is 0.6867039017798097
missing value in column LIVINGAREA_MEDI is 0.506113603036315
missing value in column NONLIVINGAPARTMENTS_MEDI is 0.697453923344547
missing value in column NONLIVINGAREA_MEDI is 0.5560179993674343
number of missing values in this column is 0.6872322378446764
NaN 191210
reg oper account 66104
reg oper spec account 10854
not specified 5086
org spec account 4978
Name: FONDKAPREMONT_MODE, dtype: int64
number of missing values in this column is 0.5061639207567785
NaN 140831
block of flats 134949
specific housing 1355
terraced house 1097
Name: HOUSETYPE_MODE, dtype: int64
missing value in column TOTALAREA_MODE is 0.48705396934932
number of missing values in this column is 0.5126728773110211
NaN 142642
Panel 59340
Stone, brick 58088
Block 8266
Wooden 4851
Mixed 2048
Monolithic 1537
Others 1460
Name: WALLSMATERIAL_MODE, dtype: int64
number of missing values in this column is 0.4784101037982691
No 143014
NaN 133109
Yes 2109
Name: EMERGENCYSTATE_MODE, dtype: int64
missing value in column OBS_30_CNT_SOCIAL_CIRCLE is 3.5941228902498635e-06
missing value in column DEF_30_CNT_SOCIAL_CIRCLE is 3.5941228902498635e-06
missing value in column OBS_60_CNT_SOCIAL_CIRCLE is 3.5941228902498635e-06
missing value in column DEF_60_CNT_SOCIAL_CIRCLE is 3.5941228902498635e-06
missing value in column DAYS_LAST_PHONE_CHANGE is 3.5941228902498635e-06
missing value in column FLAG_DOCUMENT_2 is 0.0
missing value in column FLAG_DOCUMENT_3 is 0.0
missing value in column FLAG_DOCUMENT_4 is 0.0
missing value in column FLAG_DOCUMENT_5 is 0.0
missing value in column FLAG_DOCUMENT_6 is 0.0
missing value in column FLAG_DOCUMENT_7 is 0.0
missing value in column FLAG_DOCUMENT_8 is 0.0
missing value in column FLAG_DOCUMENT_9 is 0.0
missing value in column FLAG_DOCUMENT_10 is 0.0
missing value in column FLAG_DOCUMENT_11 is 0.0
missing value in column FLAG_DOCUMENT_12 is 0.0
missing value in column FLAG_DOCUMENT_13 is 0.0
missing value in column FLAG_DOCUMENT_14 is 0.0
missing value in column FLAG_DOCUMENT_15 is 0.0
missing value in column FLAG_DOCUMENT_16 is 0.0
missing value in column FLAG_DOCUMENT_17 is 0.0
missing value in column FLAG_DOCUMENT_18 is 0.0
missing value in column FLAG_DOCUMENT_19 is 0.0
missing value in column FLAG_DOCUMENT_20 is 0.0
missing value in column FLAG_DOCUMENT_21 is 0.0
missing value in column AMT_REQ_CREDIT_BUREAU_HOUR is 0.13384154231001466
missing value in column AMT_REQ_CREDIT_BUREAU_DAY is 0.13384154231001466
missing value in column AMT_REQ_CREDIT_BUREAU_WEEK is 0.13384154231001466
missing value in column AMT_REQ_CREDIT_BUREAU_MON is 0.13384154231001466
missing value in column AMT_REQ_CREDIT_BUREAU_QRT is 0.13384154231001466
missing value in column AMT_REQ_CREDIT_BUREAU_YEAR is 0.13384154231001466
Summary of missing data: 65 cols has missing value, in which Numeric is 59 cols and Category is 6 cols
NUMERIC:
{'AMT_ANNUITY': 12, 'OWN_CAR_AGE': 183780, 'EXT_SOURCE_1': 158015, 'EXT_SOURCE_2': 599, 'EXT_SOURCE_3': 54658, 'APARTMENTS_AVG': 142408, 'BASEMENTAREA_AVG': 163954, 'YEARS_BEGINEXPLUATATION_AVG': 136938, 'YEARS_BUILD_AVG': 185991, 'COMMONAREA_AVG': 195236, 'ELEVATORS_AVG': 149491, 'ENTRANCES_AVG': 141322, 'FLOORSMAX_AVG': 139669, 'FLOORSMIN_AVG': 189714, 'LANDAREA_AVG': 166332, 'LIVINGAPARTMENTS_AVG': 191063, 'LIVINGAREA_AVG': 140817, 'NONLIVINGAPARTMENTS_AVG': 194054, 'NONLIVINGAREA_AVG': 154702, 'APARTMENTS_MODE': 142408, 'BASEMENTAREA_MODE': 163954, 'YEARS_BEGINEXPLUATATION_MODE': 136938, 'YEARS_BUILD_MODE': 185991, 'COMMONAREA_MODE': 195236, 'ELEVATORS_MODE': 149491, 'ENTRANCES_MODE': 141322, 'FLOORSMAX_MODE': 139669, 'FLOORSMIN_MODE': 189714, 'LANDAREA_MODE': 166332, 'LIVINGAPARTMENTS_MODE': 191063, 'LIVINGAREA_MODE': 140817, 'NONLIVINGAPARTMENTS_MODE': 194054, 'NONLIVINGAREA_MODE': 154702, 'APARTMENTS_MEDI': 142408, 'BASEMENTAREA_MEDI': 163954, 'YEARS_BEGINEXPLUATATION_MEDI': 136938, 'YEARS_BUILD_MEDI': 185991, 'COMMONAREA_MEDI': 195236, 'ELEVATORS_MEDI': 149491, 'ENTRANCES_MEDI': 141322, 'FLOORSMAX_MEDI': 139669, 'FLOORSMIN_MEDI': 189714, 'LANDAREA_MEDI': 166332, 'LIVINGAPARTMENTS_MEDI': 191063, 'LIVINGAREA_MEDI': 140817, 'NONLIVINGAPARTMENTS_MEDI': 194054, 'NONLIVINGAREA_MEDI': 154702, 'TOTALAREA_MODE': 135514, 'OBS_30_CNT_SOCIAL_CIRCLE': 1, 'DEF_30_CNT_SOCIAL_CIRCLE': 1, 'OBS_60_CNT_SOCIAL_CIRCLE': 1, 'DEF_60_CNT_SOCIAL_CIRCLE': 1, 'DAYS_LAST_PHONE_CHANGE': 1, 'AMT_REQ_CREDIT_BUREAU_HOUR': 37239, 'AMT_REQ_CREDIT_BUREAU_DAY': 37239, 'AMT_REQ_CREDIT_BUREAU_WEEK': 37239, 'AMT_REQ_CREDIT_BUREAU_MON': 37239, 'AMT_REQ_CREDIT_BUREAU_QRT': 37239, 'AMT_REQ_CREDIT_BUREAU_YEAR': 37239},
CATEGORY:
{'NAME_TYPE_SUITE': 1007, 'OCCUPATION_TYPE': 88800, 'FONDKAPREMONT_MODE': 191210, 'HOUSETYPE_MODE': 140831, 'WALLSMATERIAL_MODE': 142642, 'EMERGENCYSTATE_MODE': 133109}
Base on data description provided by source, as well as crosschecking on testset, together with above summary of missing value and data inconsistency, removing NA or imputation will process as following:
#associated on 1 record, temporary to impute by -1, later when perform data analysis, will consider to update this into proper value
df['OBS_30_CNT_SOCIAL_CIRCLE'].fillna(-1,inplace = True)
df['DEF_30_CNT_SOCIAL_CIRCLE'].fillna(-1,inplace = True)
df['OBS_60_CNT_SOCIAL_CIRCLE'].fillna(-1,inplace = True)
df['DEF_60_CNT_SOCIAL_CIRCLE'].fillna(-1,inplace = True)
#technical issue, to drop this record
df.dropna(subset=['DAYS_LAST_PHONE_CHANGE'], inplace=True)
numeric_missing = []
cat_missing = []
for i in range(len(df.columns)):
if df.iloc[:,i].dtypes == 'int64' or df.iloc[:,i].dtypes == 'float64':
if df.iloc[:,i].isnull().sum() > 0:
numeric_missing.append(df.columns[i])
else:
if df.iloc[:,i].isnull().sum() > 0:
cat_missing.append(df.columns[i])
#impute numeric value: missing value
for col in numeric_missing:
if col in data_desc[data_desc['Special'].isin(['normalized', 'normalized '])]['Row'].tolist():
df[col].fillna(-1,inplace=True)
elif col in data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist():
df[col].fillna(1,inplace = True)
#impute numeric value: inconsistent value
#days count to application time should be less than 0 (past data), thus impute to 1 to remark invalid data to same group of missing data
for col in numeric_missing:
if col in data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist() and len(df.loc[df[col] > 0]) > 0:
df[col] = df[col].apply(lambda x: x if x < 0 else 1)
let's have a look on 12 missing record of AMT_ANNUITY:
df[df['AMT_ANNUITY'].isnull()]
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47531 | 155054 | 0 | Cash loans | M | N | N | 0 | 180000.00 | 450000.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
| 50035 | 157917 | 0 | Cash loans | F | N | N | 0 | 94500.00 | 450000.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 |
| 51594 | 159744 | 0 | Cash loans | F | N | N | 0 | 202500.00 | 539100.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 |
| 55025 | 163757 | 0 | Cash loans | F | N | N | 0 | 162000.00 | 296280.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 4.00 |
| 59934 | 169487 | 0 | Cash loans | M | Y | N | 0 | 202500.00 | 360000.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 6.00 |
| 75873 | 187985 | 0 | Cash loans | M | Y | N | 0 | 144000.00 | 219249.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 11.00 |
| 89343 | 203726 | 0 | Cash loans | F | Y | N | 0 | 90000.00 | 157500.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 2.00 | 0.00 | 0.00 |
| 123872 | 243648 | 0 | Cash loans | F | N | Y | 0 | 202500.00 | 929088.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 5.00 |
| 207186 | 340147 | 0 | Cash loans | M | N | N | 0 | 171000.00 | 486000.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 1.00 | 1.00 | 0.00 | 2.00 |
| 227939 | 364022 | 0 | Cash loans | F | N | Y | 0 | 315000.00 | 628069.50 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 2.00 |
| 239329 | 377174 | 0 | Cash loans | F | N | Y | 0 | 157500.00 | 792000.00 | NaN | ... | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
| 241835 | 379997 | 0 | Cash loans | F | N | N | 0 | 315000.00 | 1483231.50 | NaN | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
12 rows × 122 columns
as Amount of Annuity is mandatory data, this missing can be technical issue during data gathering by the source, to impute by mean value.
df['AMT_ANNUITY'].fillna(df['AMT_ANNUITY'].mean(),inplace = True)
inspect data of OWN_CAR_AGE to find proper impute value
df['OWN_CAR_AGE'].describe()
count 94451.00 mean 12.02 std 11.65 min 0.00 25% 5.00 50% 9.00 75% 15.00 max 91.00 Name: OWN_CAR_AGE, dtype: float64
OWN_CAR_AGE has large missing value, presenting age or the own car, the larger of value the older of car, therefore the missing value will be denoted -1 to distinguish with 0 as new car.
df['OWN_CAR_AGE'].fillna(-1,inplace = True)
on remaining numeric data of AMT_REQ_CREDIT_BUREAU_time, missing data means there is no information, which is different from requesting data and showing 0 early request. Therefore, missing data of these fields will be denoted as -1
df['AMT_REQ_CREDIT_BUREAU_HOUR'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_DAY'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_WEEK'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_MON'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_QRT'].fillna(-1,inplace = True)
df['AMT_REQ_CREDIT_BUREAU_YEAR'].fillna(-1,inplace = True)
for cat in cat_missing:
df[cat].fillna('unknown', inplace = True)
for i in cat_missing:
print(df[i].value_counts())
print('\n')
Unaccompanied 224540 Family 36687 Spouse, partner 10431 Children 2965 Other_B 1592 unknown 1007 Other_A 766 Group of people 243 Name: NAME_TYPE_SUITE, dtype: int64 unknown 88799 Laborers 50131 Sales staff 28534 Core staff 24374 Managers 18739 Drivers 17248 High skill tech staff 10008 Accountants 8607 Medicine staff 7740 Security staff 6178 Cooking staff 5437 Cleaning staff 4235 Private service staff 2359 Low-skill Laborers 1930 Waiters/barmen staff 1170 Secretaries 1169 Realty agents 675 HR staff 478 IT staff 420 Name: OCCUPATION_TYPE, dtype: int64 unknown 191209 reg oper account 66104 reg oper spec account 10854 not specified 5086 org spec account 4978 Name: FONDKAPREMONT_MODE, dtype: int64 unknown 140831 block of flats 134948 specific housing 1355 terraced house 1097 Name: HOUSETYPE_MODE, dtype: int64 unknown 142642 Panel 59339 Stone, brick 58088 Block 8266 Wooden 4851 Mixed 2048 Monolithic 1537 Others 1460 Name: WALLSMATERIAL_MODE, dtype: int64 No 143013 unknown 133109 Yes 2109 Name: EMERGENCYSTATE_MODE, dtype: int64
#before transform:
boolean_list = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'EMERGENCYSTATE_MODE']
for b in boolean_list:
print(df[b].value_counts())
print('\n')
F 182800 M 95431 Name: CODE_GENDER, dtype: int64 N 183775 Y 94456 Name: FLAG_OWN_CAR, dtype: int64 Y 190206 N 88025 Name: FLAG_OWN_REALTY, dtype: int64 No 143013 unknown 133109 Yes 2109 Name: EMERGENCYSTATE_MODE, dtype: int64
#transform:
df['CODE_GENDER'] = df['CODE_GENDER'].map({'F':1,'M':0})
df['FLAG_OWN_CAR'] = df['FLAG_OWN_CAR'].map({'Y':1,'N':0})
df['FLAG_OWN_REALTY'] = df['FLAG_OWN_REALTY'].map({'Y':1,'N':0})
df['EMERGENCYSTATE_MODE'] = df['EMERGENCYSTATE_MODE'].map({'Yes':1,'No':0, 'unknown': -1})
#after transform:
for b in boolean_list:
print(df[b].value_counts())
print('\n')
1 182800 0 95431 Name: CODE_GENDER, dtype: int64 0 183775 1 94456 Name: FLAG_OWN_CAR, dtype: int64 1 190206 0 88025 Name: FLAG_OWN_REALTY, dtype: int64 0 143013 -1 133109 1 2109 Name: EMERGENCYSTATE_MODE, dtype: int64
missing_inspection(df)
missing value in column SK_ID_CURR is 0.0
missing value in column TARGET is 0.0
number of missing values in this column is 0.0
Cash loans 278231
Name: NAME_CONTRACT_TYPE, dtype: int64
missing value in column CODE_GENDER is 0.0
missing value in column FLAG_OWN_CAR is 0.0
missing value in column FLAG_OWN_REALTY is 0.0
missing value in column CNT_CHILDREN is 0.0
missing value in column AMT_INCOME_TOTAL is 0.0
missing value in column AMT_CREDIT is 0.0
missing value in column AMT_ANNUITY is 0.0
missing value in column AMT_GOODS_PRICE is 0.0
number of missing values in this column is 0.0
Unaccompanied 224540
Family 36687
Spouse, partner 10431
Children 2965
Other_B 1592
unknown 1007
Other_A 766
Group of people 243
Name: NAME_TYPE_SUITE, dtype: int64
number of missing values in this column is 0.0
Working 142718
Commercial associate 63652
Pensioner 51993
State servant 19836
Student 15
Unemployed 15
Maternity leave 2
Name: NAME_INCOME_TYPE, dtype: int64
number of missing values in this column is 0.0
Secondary / secondary special 200125
Higher education 65321
Incomplete higher 9031
Lower secondary 3608
Academic degree 146
Name: NAME_EDUCATION_TYPE, dtype: int64
number of missing values in this column is 0.0
Married 178710
Single / not married 39709
Civil marriage 26981
Separated 17846
Widow 14985
Name: NAME_FAMILY_STATUS, dtype: int64
number of missing values in this column is 0.0
House / apartment 247388
With parents 12981
Municipal apartment 10237
Rented apartment 4290
Office apartment 2341
Co-op apartment 994
Name: NAME_HOUSING_TYPE, dtype: int64
missing value in column REGION_POPULATION_RELATIVE is 0.0
missing value in column DAYS_BIRTH is 0.0
missing value in column DAYS_EMPLOYED is 0.0
missing value in column DAYS_REGISTRATION is 0.0
missing value in column DAYS_ID_PUBLISH is 0.0
missing value in column OWN_CAR_AGE is 0.0
missing value in column FLAG_MOBIL is 0.0
missing value in column FLAG_EMP_PHONE is 0.0
missing value in column FLAG_WORK_PHONE is 0.0
missing value in column FLAG_CONT_MOBILE is 0.0
missing value in column FLAG_PHONE is 0.0
missing value in column FLAG_EMAIL is 0.0
number of missing values in this column is 0.0
unknown 88799
Laborers 50131
Sales staff 28534
Core staff 24374
Managers 18739
Drivers 17248
High skill tech staff 10008
Accountants 8607
Medicine staff 7740
Security staff 6178
Cooking staff 5437
Cleaning staff 4235
Private service staff 2359
Low-skill Laborers 1930
Waiters/barmen staff 1170
Secretaries 1169
Realty agents 675
HR staff 478
IT staff 420
Name: OCCUPATION_TYPE, dtype: int64
missing value in column CNT_FAM_MEMBERS is 0.0
missing value in column REGION_RATING_CLIENT is 0.0
missing value in column REGION_RATING_CLIENT_W_CITY is 0.0
number of missing values in this column is 0.0
TUESDAY 49110
WEDNESDAY 47035
MONDAY 45954
THURSDAY 45815
FRIDAY 45382
SATURDAY 30495
SUNDAY 14440
Name: WEEKDAY_APPR_PROCESS_START, dtype: int64
missing value in column HOUR_APPR_PROCESS_START is 0.0
missing value in column REG_REGION_NOT_LIVE_REGION is 0.0
missing value in column REG_REGION_NOT_WORK_REGION is 0.0
missing value in column LIVE_REGION_NOT_WORK_REGION is 0.0
missing value in column REG_CITY_NOT_LIVE_CITY is 0.0
missing value in column REG_CITY_NOT_WORK_CITY is 0.0
missing value in column LIVE_CITY_NOT_WORK_CITY is 0.0
number of missing values in this column is 0.0
Business Entity Type 3 60755
XNA 52008
Self-employed 34518
Other 15014
Medicine 10118
Government 9469
Business Entity Type 2 9363
School 8067
Trade: type 7 7058
Kindergarten 6273
Construction 6084
Business Entity Type 1 5245
Transport: type 4 4925
Trade: type 3 3051
Industry: type 9 3022
Industry: type 3 2986
Security 2955
Housing 2715
Industry: type 11 2433
Military 2379
Agriculture 2293
Bank 2111
Police 2086
Transport: type 2 2002
Postal 1985
Security Ministries 1763
Restaurant 1610
Trade: type 2 1599
Services 1400
University 1174
Industry: type 7 1164
Transport: type 3 1081
Industry: type 1 950
Hotel 879
Electricity 843
Industry: type 4 767
Trade: type 6 551
Industry: type 5 538
Telecom 523
Insurance 513
Emergency 505
Industry: type 2 409
Advertising 368
Realtor 333
Culture 330
Industry: type 12 323
Trade: type 1 300
Mobile 268
Legal Services 259
Cleaning 236
Transport: type 1 179
Industry: type 6 97
Industry: type 10 94
Religion 79
Industry: type 13 59
Trade: type 4 54
Trade: type 5 46
Industry: type 8 22
Name: ORGANIZATION_TYPE, dtype: int64
missing value in column EXT_SOURCE_1 is 0.0
missing value in column EXT_SOURCE_2 is 0.0
missing value in column EXT_SOURCE_3 is 0.0
missing value in column APARTMENTS_AVG is 0.0
missing value in column BASEMENTAREA_AVG is 0.0
missing value in column YEARS_BEGINEXPLUATATION_AVG is 0.0
missing value in column YEARS_BUILD_AVG is 0.0
missing value in column COMMONAREA_AVG is 0.0
missing value in column ELEVATORS_AVG is 0.0
missing value in column ENTRANCES_AVG is 0.0
missing value in column FLOORSMAX_AVG is 0.0
missing value in column FLOORSMIN_AVG is 0.0
missing value in column LANDAREA_AVG is 0.0
missing value in column LIVINGAPARTMENTS_AVG is 0.0
missing value in column LIVINGAREA_AVG is 0.0
missing value in column NONLIVINGAPARTMENTS_AVG is 0.0
missing value in column NONLIVINGAREA_AVG is 0.0
missing value in column APARTMENTS_MODE is 0.0
missing value in column BASEMENTAREA_MODE is 0.0
missing value in column YEARS_BEGINEXPLUATATION_MODE is 0.0
missing value in column YEARS_BUILD_MODE is 0.0
missing value in column COMMONAREA_MODE is 0.0
missing value in column ELEVATORS_MODE is 0.0
missing value in column ENTRANCES_MODE is 0.0
missing value in column FLOORSMAX_MODE is 0.0
missing value in column FLOORSMIN_MODE is 0.0
missing value in column LANDAREA_MODE is 0.0
missing value in column LIVINGAPARTMENTS_MODE is 0.0
missing value in column LIVINGAREA_MODE is 0.0
missing value in column NONLIVINGAPARTMENTS_MODE is 0.0
missing value in column NONLIVINGAREA_MODE is 0.0
missing value in column APARTMENTS_MEDI is 0.0
missing value in column BASEMENTAREA_MEDI is 0.0
missing value in column YEARS_BEGINEXPLUATATION_MEDI is 0.0
missing value in column YEARS_BUILD_MEDI is 0.0
missing value in column COMMONAREA_MEDI is 0.0
missing value in column ELEVATORS_MEDI is 0.0
missing value in column ENTRANCES_MEDI is 0.0
missing value in column FLOORSMAX_MEDI is 0.0
missing value in column FLOORSMIN_MEDI is 0.0
missing value in column LANDAREA_MEDI is 0.0
missing value in column LIVINGAPARTMENTS_MEDI is 0.0
missing value in column LIVINGAREA_MEDI is 0.0
missing value in column NONLIVINGAPARTMENTS_MEDI is 0.0
missing value in column NONLIVINGAREA_MEDI is 0.0
number of missing values in this column is 0.0
unknown 191209
reg oper account 66104
reg oper spec account 10854
not specified 5086
org spec account 4978
Name: FONDKAPREMONT_MODE, dtype: int64
number of missing values in this column is 0.0
unknown 140831
block of flats 134948
specific housing 1355
terraced house 1097
Name: HOUSETYPE_MODE, dtype: int64
missing value in column TOTALAREA_MODE is 0.0
number of missing values in this column is 0.0
unknown 142642
Panel 59339
Stone, brick 58088
Block 8266
Wooden 4851
Mixed 2048
Monolithic 1537
Others 1460
Name: WALLSMATERIAL_MODE, dtype: int64
missing value in column EMERGENCYSTATE_MODE is 0.0
missing value in column OBS_30_CNT_SOCIAL_CIRCLE is 0.0
missing value in column DEF_30_CNT_SOCIAL_CIRCLE is 0.0
missing value in column OBS_60_CNT_SOCIAL_CIRCLE is 0.0
missing value in column DEF_60_CNT_SOCIAL_CIRCLE is 0.0
missing value in column DAYS_LAST_PHONE_CHANGE is 0.0
missing value in column FLAG_DOCUMENT_2 is 0.0
missing value in column FLAG_DOCUMENT_3 is 0.0
missing value in column FLAG_DOCUMENT_4 is 0.0
missing value in column FLAG_DOCUMENT_5 is 0.0
missing value in column FLAG_DOCUMENT_6 is 0.0
missing value in column FLAG_DOCUMENT_7 is 0.0
missing value in column FLAG_DOCUMENT_8 is 0.0
missing value in column FLAG_DOCUMENT_9 is 0.0
missing value in column FLAG_DOCUMENT_10 is 0.0
missing value in column FLAG_DOCUMENT_11 is 0.0
missing value in column FLAG_DOCUMENT_12 is 0.0
missing value in column FLAG_DOCUMENT_13 is 0.0
missing value in column FLAG_DOCUMENT_14 is 0.0
missing value in column FLAG_DOCUMENT_15 is 0.0
missing value in column FLAG_DOCUMENT_16 is 0.0
missing value in column FLAG_DOCUMENT_17 is 0.0
missing value in column FLAG_DOCUMENT_18 is 0.0
missing value in column FLAG_DOCUMENT_19 is 0.0
missing value in column FLAG_DOCUMENT_20 is 0.0
missing value in column FLAG_DOCUMENT_21 is 0.0
missing value in column AMT_REQ_CREDIT_BUREAU_HOUR is 0.0
missing value in column AMT_REQ_CREDIT_BUREAU_DAY is 0.0
missing value in column AMT_REQ_CREDIT_BUREAU_WEEK is 0.0
missing value in column AMT_REQ_CREDIT_BUREAU_MON is 0.0
missing value in column AMT_REQ_CREDIT_BUREAU_QRT is 0.0
missing value in column AMT_REQ_CREDIT_BUREAU_YEAR is 0.0
Summary of missing data: 0 cols has missing value, in which Numeric is 0 cols and Category is 0 cols
NUMERIC:
{},
CATEGORY:
{}
all clean, now let's move to more interesting part: Data statistic and visualization. After this step, single data will be well understood and proper preparation for data transformation.
In this section, we will explore on statistical aspects of data, generally cover below step:
1. Data Visualization;
2. Summary Statistics;
3. Correlation Analysis;
4. Distribution Analysis.
Beforehand, let's build some function that will be used repeatedly:
#create neccessary plot for data visualization
# mainly use bar plot, histplot and plot_bar_target (when need to explore default rate association)
def plot_bar(df,column):
plt.figure(figsize=(10,6))
ax = sns.countplot(data=df, x=df[column],\
palette=[sns.color_palette("pastel")[0]],order=df[column].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title('distribution of ' + column)
plt.show()
plt.close()
def plot_hist(df, column):
plt.figure(figsize=(10, 6)) # Set the figure size
# Plot the histogram using Seaborn
sns.histplot(df[column], color="skyblue", bins=30)
# Customize the plot
plt.title('Distribution of ' + column) # Add title
plt.xlabel(column) # Add x-axis label
plt.ylabel('Frequency') # Add y-axis label
plt.grid(axis='y', linestyle='--', alpha=0.7) # Add grid for y-axis
plt.show()
plt.close()
def plot_bar_target(df, column):
df_data = df.groupby(column)['TARGET'].value_counts(normalize=True).unstack() * 100
# Reset index to convert 'column' from index to column
df_data = df_data.reset_index()
# Create a bar plot using Seaborn
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x=column, y=0, color='skyblue', label='good', data=df_data)
sns.barplot(x=column, y=1, color='lightcoral', label='bad', data=df_data)
# Customize the plot
plt.xlabel(column)
plt.ylabel('%')
plt.xticks(rotation=90)
plt.title(column + ' by default rate')
plt.legend(title='label')
# Annotate the bars with percentages
for p in ax.patches:
height = p.get_height()
ax.annotate(f'{height:.1f}%', (p.get_x() + p.get_width() / 2., height),
ha='center', va='bottom', fontsize=10, color='black')
plt.tight_layout()
plt.show()
plt.close()
cat_list = []
num_list = []
for i in df.columns.to_list():
if (df[i].dtypes == 'int64' or df[i].dtypes == 'float64') and len(df[i].value_counts()) >= 20:
num_list.append(i)
else:
cat_list.append(i)
firstly let's have a look on category data
for i in cat_list[2:len(cat_list)]:
plot_bar(df,i)
We can observe certain group has higher proportion in this dataset, let's have a look on how they associated with default rate
for i in cat_list[2:len(cat_list)]:
plot_bar_target(df,i)
from above visualization, we can detect some following highights:
#ORGANIZATION regrouping into smaller group at df['ORGANIZATION_TYPE_TRANSF']
def map_category_to_group(category):
if category in ['Business Entity Type 1',
'Business Entity Type 2',
'Business Entity Type 3']:
return 'Business Entity'
elif category in ['Industry: type 1',
'Industry: type 10',
'Industry: type 11',
'Industry: type 12',
'Industry: type 13',
'Industry: type 2',
'Industry: type 3',
'Industry: type 4',
'Industry: type 5',
'Industry: type 6',
'Industry: type 7',
'Industry: type 8',
'Industry: type 9']:
return 'Industry'
elif category in ['Trade: type 1',
'Trade: type 2',
'Trade: type 3',
'Trade: type 4',
'Trade: type 5',
'Trade: type 6',
'Trade: type 7']:
return 'Trade'
elif category in ['Transport: type 1',
'Transport: type 2',
'Transport: type 3',
'Transport: type 4']:
return 'Transport'
elif category in ['Bank','Insurance' ]:
return 'Financial Service'
elif category in ['Military','Police','Security Ministries', 'Government' ]:
return 'Gov Officer'
elif category in ['Hotel','Restaurant','Services', 'Cleaning']:
return 'Hospitality Industry'
elif category in ['Housing','Realtor']:
return 'Real Estate'
elif category in ['Telecom','Mobile', 'Postal']:
return 'Telcomunicate'
else:
return category
df['ORGANIZATION_TYPE_TRANSF'] = df['ORGANIZATION_TYPE'].apply(map_category_to_group)
#grouping DOCUMENT_TYPE_x into FLAG_EXTRA_DOCUMENT
flag_doc = [col for col in df.columns if 'FLAG_DOCUMENT_' in col] #update documents list to look on default rate
flag_doc.remove('FLAG_DOCUMENT_3') #most of customer presenting this document, therefore we don't want to group it up.
#if customer provide one of document (value = 1), to classify value 1 into FLAG_DOCUMENT_EXTRA
df['FLAG_DOCUMENT_EXTRA'] = df[flag_doc].max(axis=1)
plot_bar(df, 'FLAG_DOCUMENT_EXTRA')
plot_bar_target(df, 'FLAG_DOCUMENT_EXTRA')
for i in num_list[1:len(num_list)]:
plot_hist(df, i)
we can observe following highlights with respective strategy:
After these refining steps, will have a look on association of grouping data vs. default rate.
df[df['DAYS_EMPLOYED'] > 0]['DAYS_EMPLOYED'].describe()
count 52008.00 mean 365243.00 std 0.00 min 365243.00 25% 365243.00 50% 365243.00 75% 365243.00 max 365243.00 Name: DAYS_EMPLOYED, dtype: float64
As we observe there is only 1 value repeated on 52008 records, to update it to invalid group, denote by value 1.
df['DAYS_EMPLOYED'] = df['DAYS_EMPLOYED'].apply(lambda x: 1 if x > 0 else x)
Next, let's have a look on OWN_CAR_AGE
df[df['OWN_CAR_AGE'] > 60]['OWN_CAR_AGE'].describe()
count 2766.00 mean 64.30 std 0.69 min 63.00 25% 64.00 50% 64.00 75% 65.00 max 91.00 Name: OWN_CAR_AGE, dtype: float64
We have 2766 records on different values, therefore let's group them into 1 group.
bins, bin_edges = pd.qcut(df['OWN_CAR_AGE'], q=10, duplicates='drop',retbins=True)
bin_edges = [-1,0,4,8,14,60,float('+inf')]
df['OWN_CAR_AGE_BIN'] = pd.cut(df['OWN_CAR_AGE'],bins = bin_edges, right=False)
plot_bar(df,'OWN_CAR_AGE_BIN')
plot_bar_target(df,'OWN_CAR_AGE_BIN')
from above plot, we can observe that customer has newer car, from 0 to 8 years, has lower default rate.
#normalize data binning
for col in num_list[1:len(num_list)]:
if col in data_desc[data_desc['Special'].isin(['normalized', 'normalized '])]['Row'].tolist():
bin_edges = [-1,0,0.25,0.5,0.75,float('+inf')]
new_col = col + '_BIN'
df[new_col] = pd.cut(df[col],bins = bin_edges, right=False)
#days data binning
for col in num_list[1:len(num_list)]:
if col in data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist() and len(df[df[col]==1]) > 0:
bins, bin_edges = pd.qcut(df[col], q=10, duplicates='drop',retbins=True)
bin_edges[0] = float('-inf')
bin_edges[-2] = 1
bin_edges[-1] = float('+inf')
new_col = col + '_BIN'
df[new_col] = pd.cut(df[col],bins = bin_edges, right=False)
elif col in data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist() and len(df[df[col]==1]) == 0:
bins, bin_edges = pd.qcut(df[col], q=10, duplicates='drop',retbins=True)
bin_edges[0] = float('-inf')
bin_edges[-1] = float('+inf')
new_col = col + '_BIN'
df[new_col] = pd.cut(df[col],bins = bin_edges, right=False)
#remaning continous data binning
for col in num_list[1:len(num_list)]:
if col not in (data_desc[data_desc['Special'].isin(['normalized', 'normalized '])]['Row'].tolist() + data_desc[data_desc['Special']=='time only relative to the application']['Row'].tolist()):
bins, bin_edges = pd.qcut(df[col], q=10, duplicates='drop', retbins=True)
bin_edges[0] = float('-inf')
bin_edges[-1] = float('+inf')
new_col = col + '_BIN'
df[new_col] = pd.cut(df[col],bins = bin_edges, right=False)
#exceptional binning on REGION_POPULATION_RELATIVE due to data skew:
bins, bin_edges = pd.qcut(df['REGION_POPULATION_RELATIVE'], q=5, duplicates='drop', retbins=True)
bin_edges[0] = float('-inf')
df['REGION_POPULATION_RELATIVE_BIN'] = pd.cut(df['REGION_POPULATION_RELATIVE'],bins = bin_edges, right=False)
df['REGION_POPULATION_RELATIVE_BIN'].value_counts().sort_index(axis = 0)
bin_list = [col for col in df.columns if '_BIN' in col] #update bin list to look on default rate
Now let's have a look on the list of new bin to see association with default rate
for col in bin_list:
plot_bar(df, col)
for col in bin_list:
plot_bar_target(df,col)
base on above plot, we can observe these highlights:
Now we finish on data analysis, let's use these groups and update to calculate IV and WoE in order to select meaningful feature for model.
First, let's have a quick look on heatmap correlation to identify potential issue on existing feature list.
#once finalize the list after calculate WoE, to check correlation of selected features
corr_matrix = df[num_list[2:len(num_list)]].corr()
plt.figure(figsize=(15, 13))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1)
plt.xticks(fontsize=7)
plt.yticks(fontsize=7)
plt.title('Correlation Heatmap', fontsize=10)
plt.show()
plt.close()
during data inspection as well as above roughcheck on correlation heatmap, we can notice that there are potential multicollinearity, especially on group of normalize data. To confirm this hypothesis, let's calculate VIF on list of single numeric variable.
from statsmodels.stats.outliers_influence import variance_inflation_factor
np.seterr(divide='ignore', invalid='ignore') #ignore error
# Calculate VIF for each predictor variable, except group of DOCUMENT have been merged into FLAG_DOCUMENT_EXTRA
col = df.columns[3:95].tolist()
col.extend(['FLAG_DOCUMENT_EXTRA', 'FLAG_DOCUMENT_3'])
col.extend(df.columns[117:122].tolist())
vif_data = df[col].select_dtypes(include=np.number).copy() # Ensure to work with numeric data only
for i in df.columns[117:122]:
vif_data[i] = df[i].copy()
vif_data['Intercept'] = 1
vif = pd.DataFrame()
vif['VAR'] = vif_data.columns
vif['VIF'] = [variance_inflation_factor(vif_data.values, i) for i in range(vif_data.shape[1])]
len(vif[vif['VIF'] >= 10]) #check number of inflation feature
50
base on above result, we can see there are 50 numeric variables has VIF >= 10, confirm that there are multicollinearity in existing list features. To solve this issue, we will remove these high inflation variables from selected features. next step, let's have a quick look on correlation of remaining features.
corr_matrix = df[vif[vif['VIF'] < 10]['VAR'].tolist()[0:-1]].corr()
plt.figure(figsize=(15, 13))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1)
plt.xticks(fontsize=7)
plt.yticks(fontsize=7)
plt.title('Correlation Heatmap', fontsize=10)
plt.show()
plt.close()
We can observe more reasonable level of correlation amongst features. Next step, let's calculate WoE and IV on categorical data, including _BIN of selected continuous data.
remove_cat = []
for i in cat_list:
if len(df[i].value_counts()) <= 1:
remove_cat.append(i)
refine_cat_list = [el for el in cat_list if el not in remove_cat]
selected_list=[]
for i in vif[vif['VIF'] < 10]['VAR'].tolist()[0:-1]:
bin_var = i+'_BIN'
if bin_var in bin_list:
selected_list.append(bin_var)
else:
selected_list.append(i)
selected_list = selected_list + refine_cat_list
selected_list.append('ORGANIZATION_TYPE_TRANSF')
set_list = set()
for s in selected_list:
set_list.add(s)
doc_list = set(['FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9',
'ORGANIZATION_TYPE'])
selected_list = [x for x in set_list if x not in doc_list]
selected_list.sort()
len(selected_list)
54
# Function to calculate WoE and IV for a single feature
def calculate_iv(df, feature, target):
# Create a summary DataFrame to calculate IV
summary = df.groupby(feature)[target].agg(['count', 'sum'])
summary.columns = ['#total', '#bad']
summary['#good'] = summary['#total'] - summary['#bad']
summary['%bad'] = summary['#bad'] / summary['#bad'].sum()
summary['%good'] = summary['#good'] / summary['#good'].sum()
summary['woe'] = np.log(summary['%good'] / summary['%bad'])
summary['iv'] = (summary['%good'] - summary['%bad']) * summary['woe']
iv = summary['iv'].sum()
return iv, summary
# Calculate IV for each feature in the DataFrame
iv_values = {}
summaries = {}
for feature in selected_list:
if feature != 'TARGET': # Exclude the target variable
iv, summary = calculate_iv(df, feature, 'TARGET')
iv_values[feature] = iv
summaries[feature] = summary
# Print IV values for each feature
for feature, iv in iv_values.items():
print(f"IV for {feature}: {iv:.4f}")
# Optionally, print summaries for each feature
for feature, summary in summaries.items():
print(f"\nSummary for {feature}:")
print(summary)
IV for AMT_ANNUITY_BIN: 0.0285
IV for AMT_INCOME_TOTAL_BIN: 0.0091
IV for AMT_REQ_CREDIT_BUREAU_DAY: inf
IV for AMT_REQ_CREDIT_BUREAU_HOUR: inf
IV for AMT_REQ_CREDIT_BUREAU_MON_BIN: 0.0138
IV for AMT_REQ_CREDIT_BUREAU_QRT: inf
IV for AMT_REQ_CREDIT_BUREAU_WEEK: inf
IV for AMT_REQ_CREDIT_BUREAU_YEAR_BIN: 0.0174
IV for CNT_CHILDREN: inf
IV for CNT_FAM_MEMBERS: inf
IV for CODE_GENDER: 0.0425
IV for DAYS_BIRTH_BIN: 0.0905
IV for DAYS_EMPLOYED_BIN: 0.1157
IV for DAYS_ID_PUBLISH_BIN: 0.0379
IV for DAYS_REGISTRATION_BIN: 0.0281
IV for DEF_30_CNT_SOCIAL_CIRCLE: inf
IV for DEF_60_CNT_SOCIAL_CIRCLE: inf
IV for EMERGENCYSTATE_MODE: 0.0220
IV for EXT_SOURCE_1_BIN: 0.1360
IV for EXT_SOURCE_2_BIN: 0.2616
IV for EXT_SOURCE_3_BIN: 0.3044
IV for FLAG_CONT_MOBILE: 0.0000
IV for FLAG_DOCUMENT_3: 0.0163
IV for FLAG_DOCUMENT_EXTRA: 0.0221
IV for FLAG_EMAIL: 0.0000
IV for FLAG_EMP_PHONE: 0.0374
IV for FLAG_MOBIL: 0.0000
IV for FLAG_OWN_CAR: 0.0061
IV for FLAG_OWN_REALTY: 0.0003
IV for FLAG_PHONE: 0.0094
IV for FLAG_WORK_PHONE: 0.0087
IV for FONDKAPREMONT_MODE: 0.0117
IV for HOUR_APPR_PROCESS_START_BIN: 0.0089
IV for HOUSETYPE_MODE: 0.0207
IV for LIVE_CITY_NOT_WORK_CITY: 0.0137
IV for LIVE_REGION_NOT_WORK_REGION: 0.0002
IV for NAME_EDUCATION_TYPE: 0.0444
IV for NAME_FAMILY_STATUS: 0.0227
IV for NAME_HOUSING_TYPE: 0.0159
IV for NAME_INCOME_TYPE: inf
IV for NAME_TYPE_SUITE: 0.0026
IV for OCCUPATION_TYPE: 0.0820
IV for ORGANIZATION_TYPE_TRANSF: 0.0665
IV for OWN_CAR_AGE_BIN: 0.0188
IV for REGION_POPULATION_RELATIVE_BIN: 0.0146
IV for REGION_RATING_CLIENT: 0.0465
IV for REGION_RATING_CLIENT_W_CITY: 0.0492
IV for REG_CITY_NOT_LIVE_CITY: 0.0228
IV for REG_CITY_NOT_WORK_CITY: 0.0331
IV for REG_REGION_NOT_LIVE_REGION: 0.0005
IV for REG_REGION_NOT_WORK_REGION: 0.0009
IV for WALLSMATERIAL_MODE: 0.0254
IV for WEEKDAY_APPR_PROCESS_START: 0.0008
Summary for AMT_ANNUITY_BIN:
#total #bad #good %bad %good woe iv
AMT_ANNUITY_BIN
[-inf, 12672.0) 27804 1908 25896 0.08 0.10 0.21 0.00
[12672.0, 16456.5) 27497 2129 25368 0.09 0.10 0.08 0.00
[16456.5, 19975.5) 28167 2606 25561 0.11 0.10 -0.11 0.00
[19975.5, 23089.5) 27751 2623 25128 0.11 0.10 -0.14 0.00
[23089.5, 26086.5) 27588 2322 25266 0.10 0.10 -0.01 0.00
[26086.5, 29299.5) 28116 2623 25493 0.11 0.10 -0.12 0.00
[29299.5, 32895.0) 27449 2811 24638 0.12 0.10 -0.23 0.01
[32895.0, 38236.5) 28208 2412 25796 0.10 0.10 -0.03 0.00
[38236.5, 46786.5) 27821 2165 25656 0.09 0.10 0.08 0.00
[46786.5, inf) 27830 1622 26208 0.07 0.10 0.39 0.01
Summary for AMT_INCOME_TOTAL_BIN:
#total #bad #good %bad %good woe iv
AMT_INCOME_TOTAL_BIN
[-inf, 81000.0) 24252 1962 22290 0.08 0.09 0.03 0.00
[81000.0, 99000.0) 28612 2466 26146 0.11 0.10 -0.04 0.00
[99000.0, 112500.0) 8634 744 7890 0.03 0.03 -0.03 0.00
[112500.0, 135000.0) 39545 3563 35982 0.15 0.14 -0.08 0.00
[135000.0, 153000.0) 37561 3284 34277 0.14 0.13 -0.05 0.00
[153000.0, 166500.0) 27134 2427 24707 0.10 0.10 -0.08 0.00
[166500.0, 189000.0) 28774 2521 26253 0.11 0.10 -0.05 0.00
[189000.0, 225000.0) 21826 1807 20019 0.08 0.08 0.01 0.00
[225000.0, 270000.0) 27375 2116 25259 0.09 0.10 0.08 0.00
[270000.0, inf) 34518 2331 32187 0.10 0.13 0.23 0.01
Summary for AMT_REQ_CREDIT_BUREAU_DAY:
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_DAY
-1.00 37238 3985 33253 0.17 0.13 -0.27 0.01
0.00 239620 19102 220518 0.82 0.86 0.05 0.00
1.00 1189 118 1071 0.01 0.00 -0.19 0.00
2.00 94 11 83 0.00 0.00 -0.38 0.00
3.00 45 2 43 0.00 0.00 0.67 0.00
4.00 25 3 22 0.00 0.00 -0.40 0.00
5.00 9 0 9 0.00 0.00 inf inf
6.00 8 0 8 0.00 0.00 inf inf
8.00 1 0 1 0.00 0.00 inf inf
9.00 2 0 2 0.00 0.00 inf inf
Summary for AMT_REQ_CREDIT_BUREAU_HOUR:
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_HOUR
-1.00 37238 3985 33253 0.17 0.13 -0.27 0.01
0.00 239523 19114 220409 0.82 0.86 0.05 0.00
1.00 1411 116 1295 0.00 0.01 0.02 0.00
2.00 49 6 43 0.00 0.00 -0.43 0.00
3.00 9 0 9 0.00 0.00 inf inf
4.00 1 0 1 0.00 0.00 inf inf
Summary for AMT_REQ_CREDIT_BUREAU_MON_BIN:
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_MON_BIN
[-inf, 0.0) 37238 3985 33253 0.17 0.13 -0.27 0.01
[0.0, 1.0) 200509 16166 184343 0.70 0.72 0.04 0.00
[1.0, inf) 40484 3070 37414 0.13 0.15 0.10 0.00
Summary for AMT_REQ_CREDIT_BUREAU_QRT:
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_QRT
-1.00 37238 3985 33253 0.17 0.13 -0.27 0.01
0.00 194166 15808 178358 0.68 0.70 0.03 0.00
1.00 31452 2108 29344 0.09 0.12 0.24 0.01
2.00 13215 1136 12079 0.05 0.05 -0.03 0.00
3.00 1615 119 1496 0.01 0.01 0.14 0.00
4.00 446 52 394 0.00 0.00 -0.37 0.00
5.00 60 6 54 0.00 0.00 -0.20 0.00
6.00 25 6 19 0.00 0.00 -1.24 0.00
7.00 6 0 6 0.00 0.00 inf inf
8.00 6 0 6 0.00 0.00 inf inf
19.00 1 1 0 0.00 0.00 -inf inf
261.00 1 0 1 0.00 0.00 inf inf
Summary for AMT_REQ_CREDIT_BUREAU_WEEK:
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_WEEK
-1.00 37238 3985 33253 0.17 0.13 -0.27 0.01
0.00 233047 18599 214448 0.80 0.84 0.05 0.00
1.00 7636 609 7027 0.03 0.03 0.05 0.00
2.00 186 19 167 0.00 0.00 -0.22 0.00
3.00 54 3 51 0.00 0.00 0.44 0.00
4.00 34 4 30 0.00 0.00 -0.38 0.00
5.00 10 1 9 0.00 0.00 -0.20 0.00
6.00 19 1 18 0.00 0.00 0.49 0.00
7.00 2 0 2 0.00 0.00 inf inf
8.00 5 0 5 0.00 0.00 inf inf
Summary for AMT_REQ_CREDIT_BUREAU_YEAR_BIN:
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_YEAR_BIN
[-inf, 0.0) 37238 3985 33253 0.17 0.13 -0.27 0.01
[0.0, 1.0) 62941 4686 58255 0.20 0.23 0.12 0.00
[1.0, 2.0) 57505 4351 53154 0.19 0.21 0.11 0.00
[2.0, 3.0) 45984 3835 42149 0.17 0.17 0.00 0.00
[3.0, 4.0) 31084 2547 28537 0.11 0.11 0.02 0.00
[4.0, inf) 43479 3817 39662 0.16 0.16 -0.06 0.00
Summary for CNT_CHILDREN:
#total #bad #good %bad %good woe iv
CNT_CHILDREN
0 196452 15596 180856 0.67 0.71 0.05 0.00
1 54054 5048 49006 0.22 0.19 -0.12 0.00
2 23893 2177 21716 0.09 0.09 -0.10 0.00
3 3331 336 2995 0.01 0.01 -0.21 0.00
4 382 48 334 0.00 0.00 -0.46 0.00
5 79 7 72 0.00 0.00 -0.07 0.00
6 20 6 14 0.00 0.00 -1.55 0.00
7 7 0 7 0.00 0.00 inf inf
8 2 0 2 0.00 0.00 inf inf
9 2 2 0 0.00 0.00 -inf inf
10 2 0 2 0.00 0.00 inf inf
11 1 1 0 0.00 0.00 -inf inf
12 1 0 1 0.00 0.00 inf inf
14 3 0 3 0.00 0.00 inf inf
19 2 0 2 0.00 0.00 inf inf
Summary for CNT_FAM_MEMBERS:
#total #bad #good %bad %good woe iv
CNT_FAM_MEMBERS
1.00 60832 5261 55571 0.23 0.22 -0.04 0.00
2.00 145017 11320 133697 0.49 0.52 0.07 0.00
3.00 46699 4270 42429 0.18 0.17 -0.10 0.00
4.00 22086 2001 20085 0.09 0.08 -0.09 0.00
5.00 3118 306 2812 0.01 0.01 -0.18 0.00
6.00 365 48 317 0.00 0.00 -0.51 0.00
7.00 76 6 70 0.00 0.00 0.06 0.00
8.00 19 6 13 0.00 0.00 -1.62 0.00
9.00 6 0 6 0.00 0.00 inf inf
10.00 3 1 2 0.00 0.00 -1.70 0.00
11.00 1 1 0 0.00 0.00 -inf inf
12.00 2 0 2 0.00 0.00 inf inf
13.00 1 1 0 0.00 0.00 -inf inf
14.00 1 0 1 0.00 0.00 inf inf
15.00 1 0 1 0.00 0.00 inf inf
16.00 2 0 2 0.00 0.00 inf inf
20.00 2 0 2 0.00 0.00 inf inf
Summary for CODE_GENDER:
#total #bad #good %bad %good woe iv
CODE_GENDER
0 95431 10094 85337 0.43 0.33 -0.26 0.03
1 182800 13127 169673 0.57 0.67 0.16 0.02
Summary for DAYS_BIRTH_BIN:
#total #bad #good %bad %good woe iv
DAYS_BIRTH_BIN
[-inf, -22265.0) 27813 1388 26425 0.06 0.10 0.55 0.02
[-22265.0, -20581.0) 27828 1533 26295 0.07 0.10 0.45 0.02
[-20581.0, -19019.0) 27825 1809 26016 0.08 0.10 0.27 0.01
[-19019.0, -17363.0) 27812 2041 25771 0.09 0.10 0.14 0.00
[-17363.0, -15874.0) 27820 2211 25609 0.10 0.10 0.05 0.00
[-15874.0, -14537.0) 27834 2287 25547 0.10 0.10 0.02 0.00
[-14537.0, -13263.0) 27829 2555 25274 0.11 0.10 -0.10 0.00
[-13263.0, -11819.0) 27822 2857 24965 0.12 0.10 -0.23 0.01
[-11819.0, -10403.0) 27810 3150 24660 0.14 0.10 -0.34 0.01
[-10403.0, inf) 27838 3390 24448 0.15 0.10 -0.42 0.02
Summary for DAYS_EMPLOYED_BIN:
#total #bad #good %bad %good woe iv
DAYS_EMPLOYED_BIN
[-inf, -4909.0) 27816 1382 26434 0.06 0.10 0.55 0.02
[-4909.0, -3252.0) 27810 1750 26060 0.08 0.10 0.30 0.01
[-3252.0, -2381.0) 27832 2077 25755 0.09 0.10 0.12 0.00
[-2381.0, -1708.0) 27818 2368 25450 0.10 0.10 -0.02 0.00
[-1708.0, -1220.0) 27799 2812 24987 0.12 0.10 -0.21 0.00
[-1220.0, -819.0) 27815 3060 24755 0.13 0.10 -0.31 0.01
[-819.0, -445.0) 27805 3293 24512 0.14 0.10 -0.39 0.02
[-445.0, 1.0) 31528 3629 27899 0.16 0.11 -0.36 0.02
[1.0, inf) 52008 2850 49158 0.12 0.19 0.45 0.03
Summary for DAYS_ID_PUBLISH_BIN:
#total #bad #good %bad %good woe iv
DAYS_ID_PUBLISH_BIN
[-inf, -4729.0) 27819 1725 26094 0.07 0.10 0.32 0.01
[-4729.0, -4439.0) 27759 1748 26011 0.08 0.10 0.30 0.01
[-4439.0, -4184.0) 27882 1895 25987 0.08 0.10 0.22 0.00
[-4184.0, -3909.0) 27782 2215 25567 0.10 0.10 0.05 0.00
[-3909.0, -3294.0) 27839 2256 25583 0.10 0.10 0.03 0.00
[-3294.0, -2696.0) 27827 2409 25418 0.10 0.10 -0.04 0.00
[-2696.0, -2091.0) 27838 2571 25267 0.11 0.10 -0.11 0.00
[-2091.0, -1424.0) 27810 2671 25139 0.12 0.10 -0.15 0.00
[-1424.0, -757.0) 27828 2803 25025 0.12 0.10 -0.21 0.00
[-757.0, inf) 27847 2928 24919 0.13 0.10 -0.25 0.01
Summary for DAYS_REGISTRATION_BIN:
#total #bad #good %bad %good woe iv
DAYS_REGISTRATION_BIN
[-inf, -9980.0) 27812 1615 26197 0.07 0.10 0.39 0.01
[-9980.0, -8269.0) 27822 1885 25937 0.08 0.10 0.23 0.00
[-8269.0, -6804.0) 27820 2078 25742 0.09 0.10 0.12 0.00
[-6804.0, -5484.0) 27813 2273 25540 0.10 0.10 0.02 0.00
[-5484.0, -4522.0) 27841 2489 25352 0.11 0.10 -0.08 0.00
[-4522.0, -3558.0) 27821 2534 25287 0.11 0.10 -0.10 0.00
[-3558.0, -2557.0) 27806 2487 25319 0.11 0.10 -0.08 0.00
[-2557.0, -1504.0) 27825 2420 25405 0.10 0.10 -0.05 0.00
[-1504.0, -698.0) 27814 2622 25192 0.11 0.10 -0.13 0.00
[-698.0, inf) 27857 2818 25039 0.12 0.10 -0.21 0.00
Summary for DEF_30_CNT_SOCIAL_CIRCLE:
#total #bad #good %bad %good woe iv
DEF_30_CNT_SOCIAL_CIRCLE
-1.00 1 0 1 0.00 0.00 inf inf
0.00 246054 19775 226279 0.85 0.89 0.04 0.00
1.00 25885 2686 23199 0.12 0.09 -0.24 0.01
2.00 4892 577 4315 0.02 0.02 -0.38 0.00
3.00 1101 142 959 0.01 0.00 -0.49 0.00
4.00 232 32 200 0.00 0.00 -0.56 0.00
5.00 54 8 46 0.00 0.00 -0.65 0.00
6.00 10 1 9 0.00 0.00 -0.20 0.00
7.00 1 0 1 0.00 0.00 inf inf
8.00 1 0 1 0.00 0.00 inf inf
Summary for DEF_60_CNT_SOCIAL_CIRCLE:
#total #bad #good %bad %good woe iv
DEF_60_CNT_SOCIAL_CIRCLE
-1.00 1 0 1 0.00 0.00 inf inf
0.00 254692 20623 234069 0.89 0.92 0.03 0.00
1.00 19908 2131 17777 0.09 0.07 -0.27 0.01
2.00 2924 357 2567 0.02 0.01 -0.42 0.00
3.00 561 93 468 0.00 0.00 -0.78 0.00
4.00 122 14 108 0.00 0.00 -0.35 0.00
5.00 19 3 16 0.00 0.00 -0.72 0.00
6.00 3 0 3 0.00 0.00 inf inf
7.00 1 0 1 0.00 0.00 inf inf
Summary for EMERGENCYSTATE_MODE:
#total #bad #good %bad %good woe iv
EMERGENCYSTATE_MODE
-1 133109 12655 120454 0.54 0.47 -0.14 0.01
0 143013 10360 132653 0.45 0.52 0.15 0.01
1 2109 206 1903 0.01 0.01 -0.17 0.00
Summary for EXT_SOURCE_1_BIN:
#total #bad #good %bad %good woe iv
EXT_SOURCE_1_BIN
[-1.0, 0.0) 158014 13841 144173 0.60 0.57 -0.05 0.00
[0.0, 0.25) 16987 2825 14162 0.12 0.06 -0.78 0.05
[0.25, 0.5) 41786 3759 38027 0.16 0.15 -0.08 0.00
[0.5, 0.75) 43943 2275 41668 0.10 0.16 0.51 0.03
[0.75, inf) 17501 521 16980 0.02 0.07 1.09 0.05
Summary for EXT_SOURCE_2_BIN:
#total #bad #good %bad %good woe iv
EXT_SOURCE_2_BIN
[-1.0, 0.0) 598 50 548 0.00 0.00 -0.00 0.00
[0.0, 0.25) 34298 6157 28141 0.27 0.11 -0.88 0.14
[0.25, 0.5) 71839 7398 64441 0.32 0.25 -0.23 0.02
[0.5, 0.75) 158629 9256 149373 0.40 0.59 0.38 0.07
[0.75, inf) 12867 360 12507 0.02 0.05 1.15 0.04
Summary for EXT_SOURCE_3_BIN:
#total #bad #good %bad %good woe iv
EXT_SOURCE_3_BIN
[-1.0, 0.0) 54657 5276 49381 0.23 0.19 -0.16 0.01
[0.0, 0.25) 26627 5258 21369 0.23 0.08 -0.99 0.14
[0.25, 0.5) 71203 6933 64270 0.30 0.25 -0.17 0.01
[0.5, 0.75) 103401 4998 98403 0.22 0.39 0.58 0.10
[0.75, inf) 22343 756 21587 0.03 0.08 0.96 0.05
Summary for FLAG_CONT_MOBILE:
#total #bad #good %bad %good woe iv
FLAG_CONT_MOBILE
0 141 13 128 0.00 0.00 -0.11 0.00
1 278090 23208 254882 1.00 1.00 0.00 0.00
Summary for FLAG_DOCUMENT_3:
#total #bad #good %bad %good woe iv
FLAG_DOCUMENT_3
0 61033 4013 57020 0.17 0.22 0.26 0.01
1 217198 19208 197990 0.83 0.78 -0.06 0.00
Summary for FLAG_DOCUMENT_EXTRA:
#total #bad #good %bad %good woe iv
FLAG_DOCUMENT_EXTRA
0 214210 19153 195057 0.82 0.76 -0.08 0.00
1 64021 4068 59953 0.18 0.24 0.29 0.02
Summary for FLAG_EMAIL:
#total #bad #good %bad %good woe iv
FLAG_EMAIL
0 262224 21909 240315 0.94 0.94 -0.00 0.00
1 16007 1312 14695 0.06 0.06 0.02 0.00
Summary for FLAG_EMP_PHONE:
#total #bad #good %bad %good woe iv
FLAG_EMP_PHONE
0 52018 2851 49167 0.12 0.19 0.45 0.03
1 226213 20370 205843 0.88 0.81 -0.08 0.01
Summary for FLAG_MOBIL:
#total #bad #good %bad %good woe iv
FLAG_MOBIL
1 278231 23221 255010 1.00 1.00 0.00 0.00
Summary for FLAG_OWN_CAR:
#total #bad #good %bad %good woe iv
FLAG_OWN_CAR
0 183775 16114 167661 0.69 0.66 -0.05 0.00
1 94456 7107 87349 0.31 0.34 0.11 0.00
Summary for FLAG_OWN_REALTY:
#total #bad #good %bad %good woe iv
FLAG_OWN_REALTY
0 88025 7520 80505 0.32 0.32 -0.03 0.00
1 190206 15701 174505 0.68 0.68 0.01 0.00
Summary for FLAG_PHONE:
#total #bad #good %bad %good woe iv
FLAG_PHONE
0 199162 17532 181630 0.76 0.71 -0.06 0.00
1 79069 5689 73380 0.24 0.29 0.16 0.01
Summary for FLAG_WORK_PHONE:
#total #bad #good %bad %good woe iv
FLAG_WORK_PHONE
0 221547 17668 203879 0.76 0.80 0.05 0.00
1 56684 5553 51131 0.24 0.20 -0.18 0.01
Summary for FONDKAPREMONT_MODE:
#total #bad #good %bad %good woe iv
FONDKAPREMONT_MODE
not specified 5086 405 4681 0.02 0.02 0.05 0.00
org spec account 4978 311 4667 0.01 0.02 0.31 0.00
reg oper account 66104 4800 61304 0.21 0.24 0.15 0.01
reg oper spec account 10854 735 10119 0.03 0.04 0.23 0.00
unknown 191209 16970 174239 0.73 0.68 -0.07 0.00
Summary for HOUR_APPR_PROCESS_START_BIN:
#total #bad #good %bad %good woe iv
HOUR_APPR_PROCESS_START_BIN
[-inf, 8.0) 20585 2150 18435 0.09 0.07 -0.25 0.01
[8.0, 9.0) 13918 1296 12622 0.06 0.05 -0.12 0.00
[9.0, 10.0) 25213 2114 23099 0.09 0.09 -0.01 0.00
[10.0, 11.0) 34526 2840 31686 0.12 0.12 0.02 0.00
[11.0, 12.0) 33841 2823 31018 0.12 0.12 0.00 0.00
[12.0, 13.0) 30772 2623 28149 0.11 0.11 -0.02 0.00
[13.0, 14.0) 27864 2316 25548 0.10 0.10 0.00 0.00
[14.0, 15.0) 24898 2041 22857 0.09 0.09 0.02 0.00
[15.0, 16.0) 22201 1734 20467 0.07 0.08 0.07 0.00
[16.0, inf) 44413 3284 41129 0.14 0.16 0.13 0.00
Summary for HOUSETYPE_MODE:
#total #bad #good %bad %good woe iv
HOUSETYPE_MODE
block of flats 134948 9745 125203 0.42 0.49 0.16 0.01
specific housing 1355 143 1212 0.01 0.00 -0.26 0.00
terraced house 1097 96 1001 0.00 0.00 -0.05 0.00
unknown 140831 13237 127594 0.57 0.50 -0.13 0.01
Summary for LIVE_CITY_NOT_WORK_CITY:
#total #bad #good %bad %good woe iv
LIVE_CITY_NOT_WORK_CITY
0 228100 18045 210055 0.78 0.82 0.06 0.00
1 50131 5176 44955 0.22 0.18 -0.23 0.01
Summary for LIVE_REGION_NOT_WORK_REGION:
#total #bad #good %bad %good woe iv
LIVE_REGION_NOT_WORK_REGION
0 267060 22223 244837 0.96 0.96 0.00 0.00
1 11171 998 10173 0.04 0.04 -0.07 0.00
Summary for NAME_EDUCATION_TYPE:
#total #bad #good %bad %good woe iv
NAME_EDUCATION_TYPE
Academic degree 146 3 143 0.00 0.00 1.47 0.00
Higher education 65321 3694 61627 0.16 0.24 0.42 0.03
Incomplete higher 9031 802 8229 0.03 0.03 -0.07 0.00
Lower secondary 3608 397 3211 0.02 0.01 -0.31 0.00
Secondary / secondary special 200125 18325 181800 0.79 0.71 -0.10 0.01
Summary for NAME_FAMILY_STATUS:
#total #bad #good %bad %good woe iv
NAME_FAMILY_STATUS
Civil marriage 26981 2764 24217 0.12 0.09 -0.23 0.01
Married 178710 13970 164740 0.60 0.65 0.07 0.00
Separated 17846 1520 16326 0.07 0.06 -0.02 0.00
Single / not married 39709 4071 35638 0.18 0.14 -0.23 0.01
Widow 14985 896 14089 0.04 0.06 0.36 0.01
Summary for NAME_HOUSING_TYPE:
#total #bad #good %bad %good woe iv
NAME_HOUSING_TYPE
Co-op apartment 994 83 911 0.00 0.00 -0.00 0.00
House / apartment 247388 19941 227447 0.86 0.89 0.04 0.00
Municipal apartment 10237 909 9328 0.04 0.04 -0.07 0.00
Office apartment 2341 159 2182 0.01 0.01 0.22 0.00
Rented apartment 4290 545 3745 0.02 0.01 -0.47 0.00
With parents 12981 1584 11397 0.07 0.04 -0.42 0.01
Summary for NAME_INCOME_TYPE:
#total #bad #good %bad %good woe iv
NAME_INCOME_TYPE
Commercial associate 63652 4990 58662 0.21 0.23 0.07 0.00
Maternity leave 2 2 0 0.00 0.00 -inf inf
Pensioner 51993 2842 49151 0.12 0.19 0.45 0.03
State servant 19836 1188 18648 0.05 0.07 0.36 0.01
Student 15 0 15 0.00 0.00 inf inf
Unemployed 15 8 7 0.00 0.00 -2.53 0.00
Working 142718 14191 128527 0.61 0.50 -0.19 0.02
Summary for NAME_TYPE_SUITE:
#total #bad #good %bad %good woe iv
NAME_TYPE_SUITE
Children 2965 226 2739 0.01 0.01 0.10 0.00
Family 36687 2820 33867 0.12 0.13 0.09 0.00
Group of people 243 21 222 0.00 0.00 -0.04 0.00
Other_A 766 74 692 0.00 0.00 -0.16 0.00
Other_B 1592 155 1437 0.01 0.01 -0.17 0.00
Spouse, partner 10431 851 9580 0.04 0.04 0.02 0.00
Unaccompanied 224540 19026 205514 0.82 0.81 -0.02 0.00
unknown 1007 48 959 0.00 0.00 0.60 0.00
Summary for OCCUPATION_TYPE:
#total #bad #good %bad %good woe iv
OCCUPATION_TYPE
Accountants 8607 434 8173 0.02 0.03 0.54 0.01
Cleaning staff 4235 409 3826 0.02 0.02 -0.16 0.00
Cooking staff 5437 577 4860 0.02 0.02 -0.27 0.00
Core staff 24374 1609 22765 0.07 0.09 0.25 0.01
Drivers 17248 2002 15246 0.09 0.06 -0.37 0.01
HR staff 478 33 445 0.00 0.00 0.21 0.00
High skill tech staff 10008 643 9365 0.03 0.04 0.28 0.00
IT staff 420 33 387 0.00 0.00 0.07 0.00
Laborers 50131 5489 44642 0.24 0.18 -0.30 0.02
Low-skill Laborers 1930 344 1586 0.01 0.01 -0.87 0.01
Managers 18739 1241 17498 0.05 0.07 0.25 0.00
Medicine staff 7740 530 7210 0.02 0.03 0.21 0.00
Private service staff 2359 158 2201 0.01 0.01 0.24 0.00
Realty agents 675 55 620 0.00 0.00 0.03 0.00
Sales staff 28534 2833 25701 0.12 0.10 -0.19 0.00
Secretaries 1169 84 1085 0.00 0.00 0.16 0.00
Security staff 6178 684 5494 0.03 0.02 -0.31 0.00
Waiters/barmen staff 1170 137 1033 0.01 0.00 -0.38 0.00
unknown 88799 5926 82873 0.26 0.32 0.24 0.02
Summary for ORGANIZATION_TYPE_TRANSF:
#total #bad #good %bad %good woe iv
ORGANIZATION_TYPE_TRANSF
Advertising 368 33 335 0.00 0.00 -0.08 0.00
Agriculture 2293 239 2054 0.01 0.01 -0.25 0.00
Business Entity 75363 7197 68166 0.31 0.27 -0.15 0.01
Construction 6084 734 5350 0.03 0.02 -0.41 0.00
Culture 330 17 313 0.00 0.00 0.52 0.00
Electricity 843 62 781 0.00 0.00 0.14 0.00
Emergency 505 37 468 0.00 0.00 0.14 0.00
Financial Service 2624 144 2480 0.01 0.01 0.45 0.00
Gov Officer 15697 1017 14680 0.04 0.06 0.27 0.00
Hospitality Industry 4125 372 3753 0.02 0.01 -0.08 0.00
Industry 12864 1151 11713 0.05 0.05 -0.08 0.00
Kindergarten 6273 462 5811 0.02 0.02 0.14 0.00
Legal Services 259 21 238 0.00 0.00 0.03 0.00
Medicine 10118 689 9429 0.03 0.04 0.22 0.00
Other 15014 1196 13818 0.05 0.05 0.05 0.00
Real Estate 3048 258 2790 0.01 0.01 -0.02 0.00
Religion 79 5 74 0.00 0.00 0.30 0.00
School 8067 493 7574 0.02 0.03 0.34 0.00
Security 2955 306 2649 0.01 0.01 -0.24 0.00
Self-employed 34518 3622 30896 0.16 0.12 -0.25 0.01
Telcomunicate 2776 232 2544 0.01 0.01 -0.00 0.00
Trade 12659 1197 11462 0.05 0.04 -0.14 0.00
Transport 8187 825 7362 0.04 0.03 -0.21 0.00
University 1174 62 1112 0.00 0.00 0.49 0.00
XNA 52008 2850 49158 0.12 0.19 0.45 0.03
Summary for OWN_CAR_AGE_BIN:
#total #bad #good %bad %good woe iv
OWN_CAR_AGE_BIN
[-inf, 2.0) 190182 16545 173637 0.71 0.68 -0.05 0.00
[2.0, 8.0) 31609 1823 29786 0.08 0.12 0.40 0.02
[8.0, 14.0) 26158 2037 24121 0.09 0.09 0.08 0.00
[14.0, inf) 30282 2816 27466 0.12 0.11 -0.12 0.00
Summary for REGION_POPULATION_RELATIVE_BIN:
#total #bad #good %bad %good woe iv
REGION_POPULATION_RELATIVE_BIN
[-inf, 0.00863) 54170 4968 49202 0.22 0.20 -0.09 0.00
[0.00863, 0.0166) 54917 4501 50416 0.20 0.20 0.03 0.00
[0.0166, 0.0207) 53752 5174 48578 0.23 0.20 -0.14 0.00
[0.0207, 0.0308) 54416 4544 49872 0.20 0.20 0.01 0.00
[0.0308, 0.0725) 53619 3714 49905 0.16 0.20 0.22 0.01
Summary for REGION_RATING_CLIENT:
#total #bad #good %bad %good woe iv
REGION_RATING_CLIENT
1 28408 1451 26957 0.06 0.11 0.53 0.02
2 205826 16711 189115 0.72 0.74 0.03 0.00
3 43997 5059 38938 0.22 0.15 -0.36 0.02
Summary for REGION_RATING_CLIENT_W_CITY:
#total #bad #good %bad %good woe iv
REGION_RATING_CLIENT_W_CITY
1 30190 1545 28645 0.07 0.11 0.52 0.02
2 208032 16963 191069 0.73 0.75 0.03 0.00
3 40009 4713 35296 0.20 0.14 -0.38 0.02
Summary for REG_CITY_NOT_LIVE_CITY:
#total #bad #good %bad %good woe iv
REG_CITY_NOT_LIVE_CITY
0 256810 20498 236312 0.88 0.93 0.05 0.00
1 21421 2723 18698 0.12 0.07 -0.47 0.02
Summary for REG_CITY_NOT_WORK_CITY:
#total #bad #good %bad %good woe iv
REG_CITY_NOT_WORK_CITY
0 214282 16185 198097 0.70 0.78 0.11 0.01
1 63949 7036 56913 0.30 0.22 -0.31 0.02
Summary for REG_REGION_NOT_LIVE_REGION:
#total #bad #good %bad %good woe iv
REG_REGION_NOT_LIVE_REGION
0 274222 22827 251395 0.98 0.99 0.00 0.00
1 4009 394 3615 0.02 0.01 -0.18 0.00
Summary for REG_REGION_NOT_WORK_REGION:
#total #bad #good %bad %good woe iv
REG_REGION_NOT_WORK_REGION
0 264418 21927 242491 0.94 0.95 0.01 0.00
1 13813 1294 12519 0.06 0.05 -0.13 0.00
Summary for WALLSMATERIAL_MODE:
#total #bad #good %bad %good woe iv
WALLSMATERIAL_MODE
Block 8266 613 7653 0.03 0.03 0.13 0.00
Mixed 2048 162 1886 0.01 0.01 0.06 0.00
Monolithic 1537 80 1457 0.00 0.01 0.51 0.00
Others 1460 123 1337 0.01 0.01 -0.01 0.00
Panel 59339 3918 55421 0.17 0.22 0.25 0.01
Stone, brick 58088 4454 53634 0.19 0.21 0.09 0.00
Wooden 4851 489 4362 0.02 0.02 -0.21 0.00
unknown 142642 13382 129260 0.58 0.51 -0.13 0.01
Summary for WEEKDAY_APPR_PROCESS_START:
#total #bad #good %bad %good woe iv
WEEKDAY_APPR_PROCESS_START
FRIDAY 45382 3835 41547 0.17 0.16 -0.01 0.00
MONDAY 45954 3675 42279 0.16 0.17 0.05 0.00
SATURDAY 30495 2475 28020 0.11 0.11 0.03 0.00
SUNDAY 14440 1200 13240 0.05 0.05 0.00 0.00
THURSDAY 45815 3841 41974 0.17 0.16 -0.00 0.00
TUESDAY 49110 4250 44860 0.18 0.18 -0.04 0.00
WEDNESDAY 47035 3945 43090 0.17 0.17 -0.01 0.00
we observe some features has IV = inf, let's inspect in detail:
inf_iv = []
for i, v in iv_values.items():
if v == float('inf'):
inf_iv.append(i)
for i in inf_iv:
print(pd.DataFrame(summaries[i]))
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_DAY
-1.00 37238 3985 33253 0.17 0.13 -0.27 0.01
0.00 239620 19102 220518 0.82 0.86 0.05 0.00
1.00 1189 118 1071 0.01 0.00 -0.19 0.00
2.00 94 11 83 0.00 0.00 -0.38 0.00
3.00 45 2 43 0.00 0.00 0.67 0.00
4.00 25 3 22 0.00 0.00 -0.40 0.00
5.00 9 0 9 0.00 0.00 inf inf
6.00 8 0 8 0.00 0.00 inf inf
8.00 1 0 1 0.00 0.00 inf inf
9.00 2 0 2 0.00 0.00 inf inf
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_HOUR
-1.00 37238 3985 33253 0.17 0.13 -0.27 0.01
0.00 239523 19114 220409 0.82 0.86 0.05 0.00
1.00 1411 116 1295 0.00 0.01 0.02 0.00
2.00 49 6 43 0.00 0.00 -0.43 0.00
3.00 9 0 9 0.00 0.00 inf inf
4.00 1 0 1 0.00 0.00 inf inf
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_QRT
-1.00 37238 3985 33253 0.17 0.13 -0.27 0.01
0.00 194166 15808 178358 0.68 0.70 0.03 0.00
1.00 31452 2108 29344 0.09 0.12 0.24 0.01
2.00 13215 1136 12079 0.05 0.05 -0.03 0.00
3.00 1615 119 1496 0.01 0.01 0.14 0.00
4.00 446 52 394 0.00 0.00 -0.37 0.00
5.00 60 6 54 0.00 0.00 -0.20 0.00
6.00 25 6 19 0.00 0.00 -1.24 0.00
7.00 6 0 6 0.00 0.00 inf inf
8.00 6 0 6 0.00 0.00 inf inf
19.00 1 1 0 0.00 0.00 -inf inf
261.00 1 0 1 0.00 0.00 inf inf
#total #bad #good %bad %good woe iv
AMT_REQ_CREDIT_BUREAU_WEEK
-1.00 37238 3985 33253 0.17 0.13 -0.27 0.01
0.00 233047 18599 214448 0.80 0.84 0.05 0.00
1.00 7636 609 7027 0.03 0.03 0.05 0.00
2.00 186 19 167 0.00 0.00 -0.22 0.00
3.00 54 3 51 0.00 0.00 0.44 0.00
4.00 34 4 30 0.00 0.00 -0.38 0.00
5.00 10 1 9 0.00 0.00 -0.20 0.00
6.00 19 1 18 0.00 0.00 0.49 0.00
7.00 2 0 2 0.00 0.00 inf inf
8.00 5 0 5 0.00 0.00 inf inf
#total #bad #good %bad %good woe iv
CNT_CHILDREN
0 196452 15596 180856 0.67 0.71 0.05 0.00
1 54054 5048 49006 0.22 0.19 -0.12 0.00
2 23893 2177 21716 0.09 0.09 -0.10 0.00
3 3331 336 2995 0.01 0.01 -0.21 0.00
4 382 48 334 0.00 0.00 -0.46 0.00
5 79 7 72 0.00 0.00 -0.07 0.00
6 20 6 14 0.00 0.00 -1.55 0.00
7 7 0 7 0.00 0.00 inf inf
8 2 0 2 0.00 0.00 inf inf
9 2 2 0 0.00 0.00 -inf inf
10 2 0 2 0.00 0.00 inf inf
11 1 1 0 0.00 0.00 -inf inf
12 1 0 1 0.00 0.00 inf inf
14 3 0 3 0.00 0.00 inf inf
19 2 0 2 0.00 0.00 inf inf
#total #bad #good %bad %good woe iv
CNT_FAM_MEMBERS
1.00 60832 5261 55571 0.23 0.22 -0.04 0.00
2.00 145017 11320 133697 0.49 0.52 0.07 0.00
3.00 46699 4270 42429 0.18 0.17 -0.10 0.00
4.00 22086 2001 20085 0.09 0.08 -0.09 0.00
5.00 3118 306 2812 0.01 0.01 -0.18 0.00
6.00 365 48 317 0.00 0.00 -0.51 0.00
7.00 76 6 70 0.00 0.00 0.06 0.00
8.00 19 6 13 0.00 0.00 -1.62 0.00
9.00 6 0 6 0.00 0.00 inf inf
10.00 3 1 2 0.00 0.00 -1.70 0.00
11.00 1 1 0 0.00 0.00 -inf inf
12.00 2 0 2 0.00 0.00 inf inf
13.00 1 1 0 0.00 0.00 -inf inf
14.00 1 0 1 0.00 0.00 inf inf
15.00 1 0 1 0.00 0.00 inf inf
16.00 2 0 2 0.00 0.00 inf inf
20.00 2 0 2 0.00 0.00 inf inf
#total #bad #good %bad %good woe iv
DEF_30_CNT_SOCIAL_CIRCLE
-1.00 1 0 1 0.00 0.00 inf inf
0.00 246054 19775 226279 0.85 0.89 0.04 0.00
1.00 25885 2686 23199 0.12 0.09 -0.24 0.01
2.00 4892 577 4315 0.02 0.02 -0.38 0.00
3.00 1101 142 959 0.01 0.00 -0.49 0.00
4.00 232 32 200 0.00 0.00 -0.56 0.00
5.00 54 8 46 0.00 0.00 -0.65 0.00
6.00 10 1 9 0.00 0.00 -0.20 0.00
7.00 1 0 1 0.00 0.00 inf inf
8.00 1 0 1 0.00 0.00 inf inf
#total #bad #good %bad %good woe iv
DEF_60_CNT_SOCIAL_CIRCLE
-1.00 1 0 1 0.00 0.00 inf inf
0.00 254692 20623 234069 0.89 0.92 0.03 0.00
1.00 19908 2131 17777 0.09 0.07 -0.27 0.01
2.00 2924 357 2567 0.02 0.01 -0.42 0.00
3.00 561 93 468 0.00 0.00 -0.78 0.00
4.00 122 14 108 0.00 0.00 -0.35 0.00
5.00 19 3 16 0.00 0.00 -0.72 0.00
6.00 3 0 3 0.00 0.00 inf inf
7.00 1 0 1 0.00 0.00 inf inf
#total #bad #good %bad %good woe iv
NAME_INCOME_TYPE
Commercial associate 63652 4990 58662 0.21 0.23 0.07 0.00
Maternity leave 2 2 0 0.00 0.00 -inf inf
Pensioner 51993 2842 49151 0.12 0.19 0.45 0.03
State servant 19836 1188 18648 0.05 0.07 0.36 0.01
Student 15 0 15 0.00 0.00 inf inf
Unemployed 15 8 7 0.00 0.00 -2.53 0.00
Working 142718 14191 128527 0.61 0.50 -0.19 0.02
In this scope of project, since there are too many variables can impact to model robust, threshold for IV value will be set at 0.05 to only allow fair predictive features. For feature has IV < 0.05 will be remove from final_features before fitting data to model. From above result, we can address the inf value caused from small group contain 0 data of good/bad record. Statistically, we can assume distribution of good/bad in these cases with small value 0.00000001. By that, selected feature based on correcting IV to be adding NAME_INCOME_TYPE with IV 0.06
#selecting feature has IV >= 0.1:
final_feature = []
for feature, iv in iv_values.items():
if iv >= 0.01 and iv != float('inf'):
final_feature.append(feature)
#update inspecting feature has IV = inf as above analysis:
final_feature.append('NAME_INCOME_TYPE')
#adding target
final_feature.append('TARGET')
df_filter = df[final_feature]
print(len(df_filter.columns))
32
now with this final set of 31 features, let's proceed to model fitting and selection
In this project, for classification problem, Logistic Regression will be used. Dataset now will be transformed to fit in model.
X = pd.get_dummies(df_filter.drop('TARGET',axis = 1),drop_first=True)
y = df_filter['TARGET']
To assessing model prediction, dataset will be splitted into 3 parts: 50% training, 25% test and 25% validation. With 31 features in various scale, we will also use StandardScaler to standardize the scale of feature character.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.5, random_state=42) #train plit
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) #test x validation split
X_train_scaled = StandardScaler().fit_transform(X_train)
X_test_scaled = StandardScaler().fit_transform(X_test)
X_val_scaled = StandardScaler().fit_transform(X_val)
model = LogisticRegression(random_state = 42, max_iter=1000) # Instantiate the model with balanced class_weight
model.fit(X_train_scaled,y_train) # Fit the model using training set
#have a look on current params
model.get_params()
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 1000,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': 42,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
#TEST SET
y_pred_t = model.predict(X_test_scaled)
y_pred_probs_t = model.predict_proba(X_test_scaled)[:, 1]
auc_roc_t = roc_auc_score(y_test, y_pred_probs_t)
CX_test = confusion_matrix(y_test, y_pred_t)
#VAL SET
y_pred_v = model.predict(X_val_scaled)
y_pred_probs_v = model.predict_proba(X_val_scaled)[:, 1]
auc_roc_v = roc_auc_score(y_val, y_pred_probs_v)
CX_val = confusion_matrix(y_val, y_pred_v)
#fpr_tr, tpr_tr, _ = roc_curve(y_train, y_pred_probs_tr)
fpr_t, tpr_t, _ = roc_curve(y_test, y_pred_probs_t)
fpr_v, tpr_v, _ = roc_curve(y_val, y_pred_probs_v)
#plt.plot(fpr_tr,tpr_tr, label='AUC train_set:'+str(round(auc_roc_tr,5)))
plt.plot(fpr_t,tpr_t, label='AUC test_set:'+str(round(auc_roc_t,5)))
plt.plot(fpr_v,tpr_v, label='AUC val_set:'+str(round(auc_roc_v,5)))
plt.axline((0,0), slope=1,fillstyle='full', c='green', linestyle='-.', linewidth=0.5)
plt.title('AUC ROC Curve')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show()
We can see the model is fitting to fair AUC score above 0.7 with describe method above. This score is above the targeting benchmark of 0.68. However, let's also try to look on tuning param by GridSearchCV with standard fold = 5, to see if we can finetuning to any better performance.
In this step, we will tune param using GridSearchCV, target to achieve best model
param_grid = {
'C': [0.001, 0.01, 1], # Regularization parameter
'solver': ['lbfgs','newton-cg', 'newton-cholesky', 'liblinear'] # Solver type
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
print("Best parameters:", grid_search.best_params_)
# Evaluate the model with best parameters on test data
best_model = grid_search.best_estimator_
y_pred_probs_b = best_model.predict_proba(X_test_scaled)[:, 1]
auc_roc_b = roc_auc_score(y_test, y_pred_probs_b)
print("AUC:", auc_roc_b)
Best parameters: {'C': 0.01, 'solver': 'liblinear'}
AUC: 0.7348482882976024
best_model.get_params()
{'C': 0.01,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 1000,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': 42,
'solver': 'liblinear',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
#evaluation on best_param
#TEST SET
y_pred_t = best_model.predict(X_test_scaled)
y_pred_probs_t = best_model.predict_proba(X_test_scaled)[:, 1]
auc_roc_t = roc_auc_score(y_test, y_pred_probs_t)
CX_test = confusion_matrix(y_test, y_pred_t)
#VAL SET
y_pred_v = best_model.predict(X_val_scaled)
y_pred_probs_v = best_model.predict_proba(X_val_scaled)[:, 1]
auc_roc_v = roc_auc_score(y_val, y_pred_probs_v)
CX_val = confusion_matrix(y_val, y_pred_v)
#fpr_tr, tpr_tr, _ = roc_curve(y_train, y_pred_probs_tr)
fpr_t, tpr_t, _ = roc_curve(y_test, y_pred_probs_t)
fpr_v, tpr_v, _ = roc_curve(y_val, y_pred_probs_v)
#plt.plot(fpr_tr,tpr_tr, label='AUC train_set:'+str(round(auc_roc_tr,5)))
plt.plot(fpr_t,tpr_t, label='AUC test_set:'+str(round(auc_roc_t,5)))
plt.plot(fpr_v,tpr_v, label='AUC val_set:'+str(round(auc_roc_v,5)))
plt.axline((0,0), slope=1,fillstyle='full', c='green', linestyle='-.', linewidth=0.5)
plt.title('AUC ROC Curve')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend()
plt.show()
After param tuning, we observe no significant different between the base model using default value vs. tuning model using best_param. Therefore, we can interchangeably these 2 models.
This result likely happened when we did select proper set of feature through intensive analysis, careful feature engineering & feature scaling earlier, so that it contributes to learning progress initially, therefore the tuning param can help little on improving performance in Logistic Regression in binary class prediction.
To demonstrate the next analysis on predictive result using confusion matrix, as showing slightly improve comparing to base model, evaluation on result of best_model will be used.
cm_clf1 = CX_test
cm_clf2 = CX_val
# Define class labels
classes = ['good', 'bad']
# Plotting
fig, axs = plt.subplots(1, 2, figsize=(12, 6)) # 1 row, 2 columns
# Plot confusion matrix for classifier 1
axs[0].imshow(cm_clf1, cmap='Blues', interpolation='nearest')
axs[0].set(xticks=np.arange(cm_clf1.shape[1]),
yticks=np.arange(cm_clf1.shape[0]),
xticklabels=classes, yticklabels=classes,
title='Confusion Matrix - test set',
xlabel='Predicted label',
ylabel='True label')
plt.setp(axs[0].get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
for i in range(cm_clf1.shape[0]):
for j in range(cm_clf1.shape[1]):
axs[0].text(j, i, format(cm_clf1[i, j], 'd'),
ha="center", va="center",
color="white" if cm_clf1[i, j] > cm_clf1.max() / 2. else "black")
# Plot confusion matrix for classifier 2
axs[1].imshow(cm_clf2, cmap='Greens', interpolation='nearest')
axs[1].set(xticks=np.arange(cm_clf2.shape[1]),
yticks=np.arange(cm_clf2.shape[0]),
xticklabels=classes, yticklabels=classes,
title='Confusion Matrix - Validation set',
xlabel='Predicted label',
ylabel='True label')
plt.setp(axs[1].get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
for i in range(cm_clf2.shape[0]):
for j in range(cm_clf2.shape[1]):
axs[1].text(j, i, format(cm_clf2[i, j], 'd'),
ha="center", va="center",
color="white" if cm_clf2[i, j] > cm_clf2.max() / 2. else "black")
plt.tight_layout()
plt.show()
let's have a quick analysis on test set & validation set to see how the model
1. TEST SET:
- original default rate = true_bad / total sample = (5693 + 55) / 69558 = 0.08264
- default rate by best_model = (true_bad|predicted_good) / predict_good = 5693 / (5693 + 63748) = 0.08198
2. VALIDATION SET:
- original default rate = true_bad / total sample = (5622 + 51) / 69558 = 0.08156
- default rate by best_model = (true_bad|predicted_good) / predict_good = 5622 / (5622 + 63819) = 0.08096
Above result proved that using model can help to improve default rate ~0.06%.
CONCLUSION
Key take away from this project:
1. Data cleaning, method of feature selection and feature engineering play important role in build a good model.
<br>
2. Using model prediction can help to reduce ~0.06% default rate in this project scope.
<br>
3. Recommend to also considering the increase in rejection rate when using model to predict, compare potential loss of interest rate from rejected incorrect predicted bad customer vs. saving loss from correct predicted bad customer, to decide model release to production.
</font>